Formats
A format in neokapi is a paired reader and writer for a document type. The reader turns a file into a stream of Parts — translatable blocks and the surrounding structure — and the writer turns that stream back into a file. This read/process/write symmetry is what lets the same tools and flows operate on any format: by the time a tool sees a Block, it no longer matters whether it came from JSON, XLIFF, or DOCX. A format is the neokapi analogue of an Okapi filter.
Reading a file splits it into translatable blocks and
a non-translatable skeleton — every tag, key, attribute, and delimiter the
writer needs to reproduce the original structure. Run a file through
pseudo-translate below and compare the source with the round-tripped output:
only the leaf text changes, while the skeleton comes back byte-for-byte. This
runs the real kapi reader and writer in your browser via WebAssembly.
neokapi ships built-in readers and writers spanning several families:
- Localization — XLIFF 1.2 and 2.x, PO/POT, TMX, Qt TS, ICU MessageFormat, Trados TTX/TXML, and translation tables.
- Document & markup — HTML, XML (with configurable translatable elements), Markdown, wiki markup, TeX/LaTeX, DTD, RTF.
- Data & configuration — JSON and YAML (with regex- and key-path-based extraction rules), Java properties, CSV/TSV, fixed-width, PHP, and generic regex extraction.
- Office & desktop publishing — Office Open XML, OpenDocument, Adobe ICML/IDML, FrameMaker MIF, EPUB, PDF.
- Subtitles — SubRip (SRT), WebVTT, TTML/DFXP.
- Images — PNG and JPEG, as localizable assets.
- Plain text variants — paragraph, Moses, versified, and spliced-line text.
An image is read as a localizable asset: the picture itself is the unit a
workflow can replace with a per-locale variant. With the kapi-vision plugin
installed (and the ocr/layout options on), the reader also extracts in-image
text and document layout — regions, reading order, tables — turning a screenshot
or scanned page into structured, translatable content. The design, and the full
set of image-localization modes, are described in
AD-029.
PDF is read by Google's PDFium rather than a built-in reader: on the desktop and
CLI through the kapi-pdfium plugin, and in the browser through PDFium compiled
to WebAssembly. Beyond text, it recovers each fragment's position on the page
(geometry) and the document's structure — headings, paragraphs, and tables — from
the PDF's own tags where present and by geometric inference otherwise. You can try
it on your own files in the PDF Lab; the design is described in
AD-028.
Each format exposes its own configuration (extraction rules, segmentation, inline-code handling). Rather than maintain a list by hand, the Format Reference is generated directly from the format registry — it always reflects the formats and parameters in the current build.
How kapi reads a file
The clearest way to see what a format reader does is to watch it parse a file.
Below, kapi reads an Android strings.xml resource and produces the content
model — the translatable blocks, their identifiers, and their source text. This
is the reader stage of the pipeline, with no transformation applied:
The same parser, pointed at a different format, produces blocks of the same shape. Here an XLIFF bilingual file resolves to the same kind of block stream:
The block shape is the same, but bilingual formats carry more. A monolingual format (JSON, YAML, properties) produces whole-block source content with no internal segment structure. A bilingual format (XLIFF, TMX) additionally populates stand-off segmentation and alignment overlays: the file's existing segment boundaries and source↔target pairings are recorded as overlays over the runs rather than baked into structure, so they survive a round-trip when present and are simply absent when a format doesn't define them. Tools and writers read those overlays; a format that emits none works at whole-block granularity.
Okapi bridge formats
With the Okapi bridge plugin installed, kapi can also dispatch to the Java-based filters of the Okapi Framework — covering additional formats such as DITA that the native readers do not — without rewriting them in Go.
Format Detection
neokapi automatically detects formats using a cascade strategy:
- Explicit MIME type (if provided)
- File extension mapping
- Magic bytes / content sniffing
You can override detection with the --format flag on any command.
Listing Formats
kapi formats
Use --mime or --ext to filter:
kapi formats --mime text/html
kapi formats --ext .docx
Interactive Format Reference
See the Format Reference page for interactive documentation of all formats with configurable parameters.