AD-005: Format System

Summary

Formats are pluggable readers and writers that convert between on-disk representations and the Part stream. The framework ships a broad set of built-in formats under core/formats/, each implementing DataFormatReader and DataFormatWriter on top of shared BaseFormatReader / BaseFormatWriter embeds. A single FormatRegistry exposes a factory-based lookup that serves native Go formats, plugin formats, and Okapi-bridge formats uniformly. Format detection cascades through MIME type, extension, magic bytes, and content sniffing. Roundtrip fidelity is supported by three interchangeable skeleton strategies.

Context

A localization framework must read a large variety of file formats and write them back with byte-exact fidelity — every newline, every entity reference, every attribute quote style. Formats vary widely in structure: linear text (plain text, Markdown), tree-structured markup (HTML, XML, DOCX), line-oriented key-value (Java properties, iOS strings), grid-based (CSV, XLSX), and translation-specific (XLIFF, TMX, TBX, Gettext).

At the same time, formats frequently contain embedded content in other formats (HTML inside JSON, Markdown inside CSV), and the reader/writer contract must accommodate this recursion without special cases.

Decision

Reader and writer interfaces

These interfaces implement the file source and sink binding in AD-026: Flow I/O Binding. Other bindings — the project store, a .klz workspace, interchange import/export — feed and drain the same Part stream without a reader or writer, so a flow is agnostic to where its content enters and leaves.

type DataFormatReader interface {
    Open(ctx context.Context, doc *RawDocument) error
    Read(ctx context.Context) <-chan PartResult
    Close() error
}

type DataFormatWriter interface {
    SetOutput(path string) error
    Write(ctx context.Context, in <-chan *Part) error
    Close() error
}

The reader lifecycle is Open → Read → Close. Open attaches the reader to a RawDocument (raw bytes plus metadata such as source locale and file path). Read returns a channel of PartResult{Part, Error} — the reader produces Parts until the document is exhausted or an error occurs, then closes the channel. Close releases any held resources.

The writer lifecycle is SetOutput → Write → Close. SetOutput sets the destination path. Write consumes a channel of *Part until the channel closes, producing output on the writer's destination.

BaseFormatReader and BaseFormatWriter

BaseFormatReader and BaseFormatWriter provide shared behavior that concrete formats embed:

Document-level Layer bracketing (PartLayerStart/PartLayerEnd for the root document layer)
Locale metadata propagation
Source/target locale accessors
Consistent error handling and channel lifecycle

A concrete format implements the format-specific parsing/serialization and delegates lifecycle to the base embed.

Built-in formats

The built-in formats under core/formats/ span several families:

Markup — HTML, XML, Markdown / MDX, and structured-document formats.
Translation exchange — XLIFF 1.2 / 2.0, TMX, Gettext PO/MO.
Structured data — JSON, YAML, CSV, and design-token / app-localization variants (xcstrings, arb, i18next, resx, Android strings, iOS strings, …).
Office and publishing — OpenXML (.docx, .xlsx, .pptx), ODF, IDML, and related packaged formats.
Subtitle / media — SRT, VTT, TTML, and similar.

The full, authoritative list of registered formats — with extensions, MIME types, and per-format options — is the generated Format Reference. It is derived from the live registry, so it never drifts from the code.

Each format package under core/formats/<name>/ contains reader.go, writer.go, and config.go. Formats register both the reader factory and writer factory in core/formats/register.go via init().

FormatRegistry

A single *FormatRegistry (a concrete struct in core/registry) exposes factory lookup. Names are the FormatID string type; registration takes a factory plus static metadata, so no reader instance is built at startup:

func (r *FormatRegistry) RegisterReader(name FormatID, factory FormatReaderFactory, sig format.FormatSignature, displayName string)
func (r *FormatRegistry) RegisterWriter(name FormatID, factory FormatWriterFactory)
func (r *FormatRegistry) NewReader(name FormatID) (format.DataFormatReader, error)
func (r *FormatRegistry) NewWriter(name FormatID) (format.DataFormatWriter, error)
func (r *FormatRegistry) FormatInfos() []FormatInfo

Detection is delegated to a *format.Detector, reachable via r.Detector(). The registry's DetectByExtension(ext) (and the source-scoped DetectByExtensionForSources) wrap it, falling back to the lazy plugin-load onMiss hook on a first miss.

Tiered registration makes native, plugin, and bridge formats indistinguishable to callers:

Native built-ins — registered at program start via init() hooks in core/formats/register.go.
Plugin formats — registered from the formats capability declared in each plugin's manifest.json, read from disk during plugin discovery (cli/pluginhost) without launching a subprocess.
Bridge formats — served by a Mode-C daemon plugin (the Okapi bridge) over a Unix-socket gRPC connection; the host registers proxy factories that dial the daemon on demand (see AD-007: Plugin System and Okapi Bridge).

A format reference in user-facing configuration uses the syntax name[@version][:preset], e.g. okf_html@1.46.0:wellFormed. The registry resolves the reference to the appropriate factory.

Format detection

Detector.Detect(path, reader, mimeType) returns the best-matching format name using a cascade:

MIME type — explicit declaration wins if present.
File extension — .html, .xliff, .json, etc. resolve deterministically.
Magic bytes — binary signatures (BOM, XML declaration, ZIP signature for OpenXML).
Content sniffing — heuristic analysis for formats that share extensions (e.g., distinguishing XLIFF 1.2 from XLIFF 2.0).

Each format registers a FormatMeta record that declares the MIME types and extensions it claims, so the cascade is data-driven rather than hardcoded.

Skeleton strategies

Three interchangeable strategies preserve non-translatable content for roundtrip writing. A format picks the one that fits its structure:

SkeletonStore streaming (HTML, XML). A temp-file-backed binary store. The reader writes non-translatable bytes and block references during extraction; the writer reads entries sequentially to reconstruct the document with byte-exact fidelity. Peak memory is ~100 KB per document regardless of document size. Preferred for new formats. See Skeleton Store for the binary format and wiring.
Re-parse (JSON, YAML, PO, Plaintext). The writer re-opens the source document and replaces translatable content in place. Simple but holds the document in memory twice during writing.
Fragment-based (XLIFF, some XML dialects). Interleaved skeleton of non-translatable markup plus references to translatable blocks, carried inline on the Data/Block resources. Suits formats whose translatable content is sparse.

All three strategies present the same DataFormatWriter interface to the pipeline.

Subfilters and nested layers

Format readers can emit child Layers when they encounter embedded content in a different format (HTML inside JSON, Markdown inside CSV). The child reader is resolved via a SubfilterResolver injected by the FormatRegistry. This mechanism is defined in AD-002: Content Model — format readers just implement SubfilterAware and declare patterns in their config.

Implementing a new format

To add a new format:

Create core/formats/<name>/ with reader.go, writer.go, and config.go.
Implement DataFormatReader by embedding BaseFormatReader and providing the format-specific parse logic.
Implement DataFormatWriter by embedding BaseFormatWriter and providing the format-specific serialize logic.
Populate every field on each inline-code run for any inline markup — ID, Type/SubType, Data, Disp, Equiv, Constraints (AD-002: Content Model).
Pick a skeleton strategy appropriate to the format's structure.
Register the reader and writer factories in core/formats/register.go via an init() call.
If the format can host embedded content, implement SubfilterAware and accept Subfilters []SubfilterMapping in the config.

See Implementing Formats for a walkthrough, and Skeleton Store for the preferred skeleton strategy details.

Consequences

Format readers emit the same streaming Part protocol regardless of source format, so tools never need format-specific code.
Format writers replay Run.Data verbatim via RenderRunsWithData (AD-002: Content Model), so roundtrip fidelity is inherited from the content model.
Native, plugin, and bridge formats coexist in one registry; the pipeline treats them identically.
MIME/extension/magic/content cascade resolves most files without user configuration; ambiguous cases fall back to explicit format flags.
Three skeleton strategies cover the full span of file formats from streaming text to zip-packaged markup.
New formats plug in by adding a directory and registering in init(); no core changes needed.
SkeletonStore gives bounded memory for large markup documents, at the cost of a temp file and a binary protocol between reader and writer.

AD-002: Content Model — Parts that readers produce and writers consume; the Run model that drives roundtrip fidelity
AD-004: Processing Engine — how readers and writers plug into the pipeline
AD-006: Tool System — the tools that sit between reader and writer
AD-026: Flow I/O Binding — readers/writers as the file binding; other bindings (store, .klz, interchange) feed the same stream
AD-007: Plugin System and Okapi Bridge — how plugin and bridge formats register
Implementing Formats — implementation walkthrough
Skeleton Store — binary skeleton format and wiring

Summary​

Context​

Decision​

Reader and writer interfaces​

BaseFormatReader and BaseFormatWriter​

Built-in formats​

FormatRegistry​

Format detection​

Skeleton strategies​

Subfilters and nested layers​

Implementing a new format​

Consequences​

Related​