AD-005: Format System
Summary
Formats are pluggable readers and writers that convert between on-disk
representations and the Part stream. The framework ships a broad set of
built-in formats under core/formats/, each implementing DataFormatReader
and DataFormatWriter on top of shared BaseFormatReader /
BaseFormatWriter embeds. A single FormatRegistry exposes a factory-based
lookup that
serves native Go formats, plugin formats, and Okapi-bridge formats
uniformly. Format detection cascades through MIME type, extension, magic
bytes, and content sniffing. Roundtrip fidelity is supported by three
interchangeable skeleton strategies.
Context
A localization framework must read a large variety of file formats and write them back with byte-exact fidelity — every newline, every entity reference, every attribute quote style. Formats vary widely in structure: linear text (plain text, Markdown), tree-structured markup (HTML, XML, DOCX), line-oriented key-value (Java properties, iOS strings), grid-based (CSV, XLSX), and translation-specific (XLIFF, TMX, TBX, Gettext).
At the same time, formats frequently contain embedded content in other formats (HTML inside JSON, Markdown inside CSV), and the reader/writer contract must accommodate this recursion without special cases.
Decision
Reader and writer interfaces
These interfaces implement the file source and sink binding in
AD-026: Flow I/O Binding. Other bindings — the project
store, a .klz workspace, interchange import/export — feed and drain the same
Part stream without a reader or writer, so a flow is agnostic to where its
content enters and leaves.
type DataFormatReader interface {
Open(ctx context.Context, doc *RawDocument) error
Read(ctx context.Context) <-chan PartResult
Close() error
}
type DataFormatWriter interface {
SetOutput(path string) error
Write(ctx context.Context, in <-chan *Part) error
Close() error
}
The reader lifecycle is Open → Read → Close. Open attaches the reader
to a RawDocument (raw bytes plus metadata such as source locale and file
path). Read returns a channel of PartResult{Part, Error} — the reader
produces Parts until the document is exhausted or an error occurs, then
closes the channel. Close releases any held resources.
The writer lifecycle is SetOutput → Write → Close. SetOutput sets the
destination path. Write consumes a channel of *Part until the channel
closes, producing output on the writer's destination.
BaseFormatReader and BaseFormatWriter
BaseFormatReader and BaseFormatWriter provide shared behavior that
concrete formats embed:
- Document-level Layer bracketing (
PartLayerStart/PartLayerEndfor the root document layer) - Locale metadata propagation
- Source/target locale accessors
- Consistent error handling and channel lifecycle
A concrete format implements the format-specific parsing/serialization and delegates lifecycle to the base embed.
Built-in formats
The built-in formats under core/formats/ span several families:
- Markup — HTML, XML, Markdown / MDX, and structured-document formats.
- Translation exchange — XLIFF 1.2 / 2.0, TMX, Gettext PO/MO.
- Structured data — JSON, YAML, CSV, and design-token / app-localization
variants (
xcstrings,arb,i18next,resx, Android strings, iOS strings, …). - Office and publishing — OpenXML (
.docx,.xlsx,.pptx), ODF, IDML, and related packaged formats. - Subtitle / media — SRT, VTT, TTML, and similar.
The full, authoritative list of registered formats — with extensions, MIME types, and per-format options — is the generated Format Reference. It is derived from the live registry, so it never drifts from the code.
Each format package under core/formats/<name>/ contains reader.go,
writer.go, and config.go. Formats register both the reader factory
and writer factory in core/formats/register.go via init().
FormatRegistry
A single *FormatRegistry (a concrete struct in core/registry) exposes
factory lookup. Names are the FormatID string type; registration takes a
factory plus static metadata, so no reader instance is built at startup:
func (r *FormatRegistry) RegisterReader(name FormatID, factory FormatReaderFactory, sig format.FormatSignature, displayName string)
func (r *FormatRegistry) RegisterWriter(name FormatID, factory FormatWriterFactory)
func (r *FormatRegistry) NewReader(name FormatID) (format.DataFormatReader, error)
func (r *FormatRegistry) NewWriter(name FormatID) (format.DataFormatWriter, error)
func (r *FormatRegistry) FormatInfos() []FormatInfo
Detection is delegated to a *format.Detector, reachable via r.Detector().
The registry's DetectByExtension(ext) (and the source-scoped
DetectByExtensionForSources) wrap it, falling back to the lazy plugin-load
onMiss hook on a first miss.
Tiered registration makes native, plugin, and bridge formats indistinguishable to callers:
- Native built-ins — registered at program start via
init()hooks incore/formats/register.go. - Plugin formats — registered from the
formatscapability declared in each plugin'smanifest.json, read from disk during plugin discovery (cli/pluginhost) without launching a subprocess. - Bridge formats — served by a Mode-C daemon plugin (the Okapi bridge) over a Unix-socket gRPC connection; the host registers proxy factories that dial the daemon on demand (see AD-007: Plugin System and Okapi Bridge).
A format reference in user-facing configuration uses the syntax
name[@version][:preset], e.g. okf_html@1.46.0:wellFormed. The registry
resolves the reference to the appropriate factory.
Format detection
Detector.Detect(path, reader, mimeType) returns the best-matching format
name using a cascade:
- MIME type — explicit declaration wins if present.
- File extension —
.html,.xliff,.json, etc. resolve deterministically. - Magic bytes — binary signatures (BOM, XML declaration, ZIP signature for OpenXML).
- Content sniffing — heuristic analysis for formats that share extensions (e.g., distinguishing XLIFF 1.2 from XLIFF 2.0).
Each format registers a FormatMeta record that declares the MIME types
and extensions it claims, so the cascade is data-driven rather than
hardcoded.
Skeleton strategies
Three interchangeable strategies preserve non-translatable content for roundtrip writing. A format picks the one that fits its structure:
-
SkeletonStore streaming (HTML, XML). A temp-file-backed binary store. The reader writes non-translatable bytes and block references during extraction; the writer reads entries sequentially to reconstruct the document with byte-exact fidelity. Peak memory is ~100 KB per document regardless of document size. Preferred for new formats. See Skeleton Store for the binary format and wiring.
-
Re-parse (JSON, YAML, PO, Plaintext). The writer re-opens the source document and replaces translatable content in place. Simple but holds the document in memory twice during writing.
-
Fragment-based (XLIFF, some XML dialects). Interleaved skeleton of non-translatable markup plus references to translatable blocks, carried inline on the
Data/Blockresources. Suits formats whose translatable content is sparse.
All three strategies present the same DataFormatWriter interface to the
pipeline.
Subfilters and nested layers
Format readers can emit child Layers when they encounter embedded content
in a different format (HTML inside JSON, Markdown inside CSV). The child
reader is resolved via a SubfilterResolver injected by the
FormatRegistry. This mechanism is defined in
AD-002: Content Model — format readers just
implement SubfilterAware and declare patterns in their config.
Implementing a new format
To add a new format:
- Create
core/formats/<name>/withreader.go,writer.go, andconfig.go. - Implement
DataFormatReaderby embeddingBaseFormatReaderand providing the format-specific parse logic. - Implement
DataFormatWriterby embeddingBaseFormatWriterand providing the format-specific serialize logic. - Populate every field on each inline-code run for any inline markup —
ID,Type/SubType,Data,Disp,Equiv,Constraints(AD-002: Content Model). - Pick a skeleton strategy appropriate to the format's structure.
- Register the reader and writer factories in
core/formats/register.govia aninit()call. - If the format can host embedded content, implement
SubfilterAwareand acceptSubfilters []SubfilterMappingin the config.
See Implementing Formats for a walkthrough, and Skeleton Store for the preferred skeleton strategy details.
Consequences
- Format readers emit the same streaming Part protocol regardless of source format, so tools never need format-specific code.
- Format writers replay
Run.Dataverbatim viaRenderRunsWithData(AD-002: Content Model), so roundtrip fidelity is inherited from the content model. - Native, plugin, and bridge formats coexist in one registry; the pipeline treats them identically.
- MIME/extension/magic/content cascade resolves most files without user configuration; ambiguous cases fall back to explicit format flags.
- Three skeleton strategies cover the full span of file formats from streaming text to zip-packaged markup.
- New formats plug in by adding a directory and registering in
init(); no core changes needed. - SkeletonStore gives bounded memory for large markup documents, at the cost of a temp file and a binary protocol between reader and writer.
Related
- AD-002: Content Model — Parts that readers produce and writers consume; the Run model that drives roundtrip fidelity
- AD-004: Processing Engine — how readers and writers plug into the pipeline
- AD-006: Tool System — the tools that sit between reader and writer
- AD-026: Flow I/O Binding — readers/writers as the
filebinding; other bindings (store,.klz, interchange) feed the same stream - AD-007: Plugin System and Okapi Bridge — how plugin and bridge formats register
- Implementing Formats — implementation walkthrough
- Skeleton Store — binary skeleton format and wiring