Formats

A format in neokapi is a paired reader and writer for a document type. The reader turns a file into a stream of Parts — translatable blocks and the surrounding structure — and the writer turns that stream back into a file. This read/process/write symmetry is what lets the same tools and flows operate on any format: by the time a tool sees a Block, it no longer matters whether it came from JSON, XLIFF, or DOCX. A format is the neokapi analogue of an Okapi filter.

See the skeleton preserved

Reading a file splits it into translatable blocks and a non-translatable skeleton — every tag, key, attribute, and delimiter the writer needs to reproduce the original structure. Run a file through pseudo-translate below and compare the source with the round-tripped output: only the leaf text changes, while the skeleton comes back byte-for-byte. This runs the real kapi reader and writer in your browser via WebAssembly.

Loading the interactive lab…

neokapi ships built-in readers and writers spanning several families:

Localization — XLIFF 1.2 and 2.x, PO/POT, TMX, Qt TS, ICU MessageFormat, Trados TTX/TXML, and translation tables.
Document & markup — HTML, XML (with configurable translatable elements), Markdown, wiki markup, TeX/LaTeX, DTD, RTF.
Data & configuration — JSON and YAML (with regex- and key-path-based extraction rules), Java properties, CSV/TSV, fixed-width, PHP, and generic regex extraction.
Office & desktop publishing — Office Open XML, OpenDocument, Adobe ICML/IDML, FrameMaker MIF, EPUB, PDF.
Subtitles — SubRip (SRT), WebVTT, TTML/DFXP.
Images — PNG and JPEG, as localizable assets.
Plain text variants — paragraph, Moses, versified, and spliced-line text.

An image is read as a localizable asset: the picture itself is the unit a workflow can replace with a per-locale variant. With the kapi-vision plugin installed (and the ocr/layout options on), the reader also extracts in-image text and document layout — regions, reading order, tables — turning a screenshot or scanned page into structured, translatable content. The design, and the full set of image-localization modes, are described in AD-029.

PDF is read by Google's PDFium rather than a built-in reader: on the desktop and CLI through the kapi-pdfium plugin, and in the browser through PDFium compiled to WebAssembly. Beyond text, it recovers each fragment's position on the page (geometry) and the document's structure — headings, paragraphs, and tables — from the PDF's own tags where present and by geometric inference otherwise. You can try it on your own files in the PDF Lab; the design is described in AD-028.

Each format exposes its own configuration (extraction rules, segmentation, inline-code handling). Rather than maintain a list by hand, the Format Reference is generated directly from the format registry — it always reflects the formats and parameters in the current build.

How kapi reads a file

The clearest way to see what a format reader does is to watch it parse a file. Below, kapi reads an Android strings.xml resource and produces the content model — the translatable blocks, their identifiers, and their source text. This is the reader stage of the pipeline, with no transformation applied:

The same parser, pointed at a different format, produces blocks of the same shape. Here an XLIFF bilingual file resolves to the same kind of block stream:

The block shape is the same, but bilingual formats carry more. A monolingual format (JSON, YAML, properties) produces whole-block source content with no internal segment structure. A bilingual format (XLIFF, TMX) additionally populates stand-off segmentation and alignment overlays: the file's existing segment boundaries and source↔target pairings are recorded as overlays over the runs rather than baked into structure, so they survive a round-trip when present and are simply absent when a format doesn't define them. Tools and writers read those overlays; a format that emits none works at whole-block granularity.

Okapi bridge formats

With the Okapi bridge plugin installed, kapi can also dispatch to the Java-based filters of the Okapi Framework — covering additional formats such as DITA that the native readers do not — without rewriting them in Go.

Format Detection

neokapi automatically detects formats using a cascade strategy:

Explicit MIME type (if provided)
File extension mapping
Magic bytes / content sniffing

You can override detection with the --format flag on any command.

Listing Formats

kapi formats

Use --mime or --ext to filter:

kapi formats --mime text/html
kapi formats --ext .docx

Interactive Format Reference

See the Format Reference page for interactive documentation of all formats with configurable parameters.

How kapi reads a file​

Okapi bridge formats​

Format Detection​

Listing Formats​

Interactive Format Reference​

How kapi reads a file

Okapi bridge formats

Format Detection

Listing Formats

Interactive Format Reference