Implementing Formats

Step-by-step guide for implementing new neokapi format readers/writers or migrating existing Okapi filters. Parent AD: AD-005.

Canonical tutorial

For the end-to-end "add a format" walkthrough, follow Implementing a Format. This note focuses on the skeleton-store, writer-fallback, and Okapi-porting internals that sit beneath that tutorial. Maintainers: the maturity bar a format must clear lives in docs/internals/format-maturity.md, and the consolidated engine reference in docs/internals/format-engineering.md.

Terminology Mapping from Okapi

Okapi (Java)	neokapi (Go)
Filter	DataFormat (Reader/Writer)
Step	Tool
Pipeline	Flow
PipelineDriver	Executor
Event	Part
TextUnit	Block
TextFragment	Run sequence (`[]Run`)
Code	Run
StartDocument / EndDocument	Layer (root)
StartSubDocument / StartSubFilter	Child Layer

File Structure

Create a package under core/formats/<name>/ with three files:

core/formats/<name>/
├── config.go       # Config struct with Reset(), Validate(), ApplyMap()
├── reader.go       # DataFormatReader implementation
├── writer.go       # DataFormatWriter implementation
├── reader_test.go  # Reader tests
├── writer_test.go  # Writer or roundtrip tests
└── testdata/       # Test input files

Config

Every format has a Config struct implementing format.DataFormatConfig:

type Config struct {
    // Format-specific options...
    // Use compiled regex caches for regex-based config (see json/config.go).
}

func (c *Config) FormatName() string { return "<name>" }

func (c *Config) Reset() {
    *c = Config{
        // Set defaults here. Use zero values intentionally —
        // bool defaults to false, so use "nonFoo" naming when
        // you want the default behavior to be "foo".
    }
}

func (c *Config) Validate() error {
    // Return non-nil error for invalid combinations.
    return nil
}

// ApplyMap applies config values from a generic map (used by CLI/presets).
func (c *Config) ApplyMap(values map[string]any) error {
    for key, val := range values {
        switch key {
        case "someOption":
            // type-assert and assign
        default:
            return fmt.Errorf("<name>: unknown parameter: %s", key)
        }
    }
    return nil
}

Reference: core/formats/json/config.go (complex config with regex caches), core/formats/plaintext/config.go (minimal config).

Reader

Embed format.BaseFormatReader and implement format.DataFormatReader:

type Reader struct {
    format.BaseFormatReader
    cfg           *Config
    skeletonStore *format.SkeletonStore
    skelBuf       bytes.Buffer // coalescing buffer for skeleton text
}

var _ format.SkeletonStoreEmitter = (*Reader)(nil)

func NewReader() *Reader {
    cfg := &Config{}
    cfg.Reset()
    return &Reader{
        BaseFormatReader: format.BaseFormatReader{
            FormatName:        "<name>",
            FormatDisplayName: "<Display Name>",
            FormatMimeType:    "application/<name>",
            FormatExtensions:  []string{".<ext>"},
            Cfg:               cfg,
        },
        cfg: cfg,
    }
}

func (r *Reader) SetSkeletonStore(store *format.SkeletonStore) {
    r.skeletonStore = store
}

BaseFormatReader supplies Name/DisplayName/Config/SetConfig. You must still implement the three methods it does not provide — Signature, Open, and Close:

func (r *Reader) Signature() format.FormatSignature {
    return format.FormatSignature{
        MIMETypes:  []string{"application/<name>"},
        Extensions: []string{".<ext>"},
    }
}

// Open validates and stashes the document; it does NOT parse. Parse errors are
// surfaced on the channel in Read (as PartResult.Error), never returned here.
func (r *Reader) Open(ctx context.Context, doc *model.RawDocument) error {
    if doc == nil || doc.Reader == nil {
        return errors.New("<name>: nil document or reader")
    }
    r.Doc = doc
    return nil
}

func (r *Reader) Close() error {
    if r.Doc != nil && r.Doc.Reader != nil {
        return r.Doc.Reader.Close()
    }
    return nil
}

Read Method Pattern

The Read method opens a goroutine that sends model.PartResult values on a channel. It must emit PartLayerStart first, then blocks/data, then PartLayerEnd:

func (r *Reader) Read(ctx context.Context) <-chan model.PartResult {
    ch := make(chan model.PartResult, 64)
    go func() {
        defer close(ch)
        r.readContent(ctx, ch)
    }()
    return ch
}

func (r *Reader) readContent(ctx context.Context, ch chan<- model.PartResult) {
    // 1. Emit PartLayerStart
    layer := &model.Layer{
        ID:     "doc",
        Name:   filepath.Base(r.Doc.URI),
        Format: "<name>",
        Locale: r.Doc.SourceLocale,
    }
    ch <- model.PartResult{Part: &model.Part{
        Type:     model.PartLayerStart,
        Resource: layer,
    }}

    // 2. Parse input, emit blocks and data
    //    (see Skeleton Store Integration below)

    // 3. Flush skeleton store
    r.skelFlush()
    if r.skeletonStore != nil {
        if err := r.skeletonStore.Flush(); err != nil {
            ch <- model.PartResult{Error: fmt.Errorf("<name>: flush skeleton: %w", err)}
            return
        }
    }

    // 4. Emit PartLayerEnd
    ch <- model.PartResult{Part: &model.Part{
        Type:     model.PartLayerEnd,
        Resource: layer,
    }}
}

Block Creation

block := model.NewBlock(blockID, sourceText)
block.Name = blockName
block.Properties["<format>.keypath"] = keyPath // format-specific metadata

ch <- model.PartResult{Part: &model.Part{
    Type:     model.PartBlock,
    Resource: block,
}}

Subfilter Support

If the format can contain embedded content (e.g., HTML strings inside JSON), implement format.SubfilterAware:

var _ format.SubfilterAware = (*Reader)(nil)

func (r *Reader) SetSubfilterResolver(resolver format.SubfilterResolver) {
    r.resolver = resolver
}

When encountering embedded content, create a child layer:

subReader, err := r.resolver.ResolveReader(subFormatName)
// Open subReader with the embedded content as a RawDocument
// Emit PartLayerStart for child, forward sub-parts, emit PartLayerEnd

Writer

Embed format.BaseFormatWriter and implement format.DataFormatWriter:

type Writer struct {
    format.BaseFormatWriter
    cfg           *Config
    skeletonStore *format.SkeletonStore
}

var _ format.SkeletonStoreConsumer = (*Writer)(nil)

func NewWriter() *Writer {
    cfg := &Config{}
    cfg.Reset()
    return &Writer{
        BaseFormatWriter: format.BaseFormatWriter{FormatName: "<name>"},
        cfg:              cfg,
    }
}

func (w *Writer) SetSkeletonStore(store *format.SkeletonStore) {
    w.skeletonStore = store
}

Write Method Pattern

The writer collects all blocks from the channel, then reconstructs the document. It should support a fallback chain:

func (w *Writer) Write(ctx context.Context, parts <-chan *model.Part) error {
    blocksByID := make(map[string]*model.Block)

    // 1. Drain channel, collect blocks
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case part, ok := <-parts:
            if !ok {
                goto done
            }
            if part.Type == model.PartBlock {
                if block, ok := part.Resource.(*model.Block); ok {
                    blocksByID[block.ID] = block
                }
            }
        }
    }

done:
    // 2. Reconstruct using fallback chain
    if w.skeletonStore != nil {
        return w.writeFromSkeleton(w.skeletonStore, blocksByID)
    }
    return w.writeFromBlocks(blocksByID) // fallback
}

Skeleton Store Reconstruction

func (w *Writer) writeFromSkeleton(
    store *format.SkeletonStore,
    blocks map[string]*model.Block,
) error {
    for {
        entry, err := store.Next()
        if err == io.EOF {
            break
        }
        if err != nil {
            return fmt.Errorf("<name> writer: read skeleton: %w", err)
        }
        switch entry.Type {
        case format.SkeletonText:
            if _, err := w.Output.Write(entry.Data); err != nil {
                return err
            }
        case format.SkeletonRef:
            refID := string(entry.Data)
            if block, ok := blocks[refID]; ok {
                text := w.encodeValue(block) // format-specific encoding
                if _, err := io.WriteString(w.Output, text); err != nil {
                    return err
                }
            }
        }
    }
    return nil
}

Write-side post-processing: the no-regex convention

A format writer MUST NOT regex- or byte-rewrite its already-serialized output to compensate for a modeling gap. That post-processing is brittle (it pattern-matches serialized markup), couples to emission ordering, and hides the fact that the model is missing a primitive. The unified pattern that every writer follows instead:

Skeleton-store emission. The reader stores non-translatable bytes verbatim; the writer replays them and splices only translated slots, so the writer introduces no structural divergence to "fix up" afterward.
Symmetric compare-time canonicalization. Cosmetic differences between two writers (attribute order, namespace decls, self-closing vs open/close, insignificant whitespace) are cancelled by the shared XMLCanonical normalizer (cli/parity/roundtrip/normalizers.go), applied to both got and ref. Reaching the canon tier — not byte — is the norm and is sufficient.
Structural merges as canonicalization, not write-side rewriting. "Merge adjacent equivalent elements" belongs in the normalizer (applied symmetrically to both sides), not in the writer (applied to one side via regex). idml's MergeAdjacentCSRs is the template.

Per-value escaping of text content before its first emission (backslash / quote / newline / delimiter encoding) is not post-processing and is fine.

The one sanctioned exception is faithfully reproducing a transform that Okapi itself performs on bytes the reader captured opaquely, where no symmetric normalizer can reach. openxml's DrawingML default-run hoist (optimiseDMLBlockProperties in dml_style_optimization.go) is the current template: the WML reader captures the entire <w:drawing> payload as opaque XML and replays it verbatim, so the only place to mirror Okapi's StyleOptimisation.Default hoist of common <a:rPr> into <a:pPr><a:defRPr> is an always-on post-skeleton flush. This is reproduction, not compensation: the reference output already contains the hoist, and because the payload is opaque to the comparator it cannot be cancelled on both sides. A writer that keeps such a transform MUST document the Okapi class/method it mirrors, so a reader can tell reproduction from compensation.

The WordprocessingML side does not qualify: native is faithful and emits source <w:rPr> inline with no synthesised paragraph styles. The former Word Style Optimisation (WSO) post-pass that mimicked Okapi's compact pStyle form has been deleted; equivalence with Okapi's compact output is instead proved by an effective-rPr normalizer in the parity comparator.

Formats already converted to this convention: html (DOM setAttr instead of lang regex), the regex format (prefix/capture/suffix assembly), wiki (stored header level), and openxml (structural <w:r> envelope emission + byte-splice run merges replacing the post-serialization fuse regexes). When a structural fix is genuinely impractical, prefer a documented div-tier divergence or a tracked follow-up issue over a new write-side regex.

Skeleton Store Integration

The SkeletonStore (core/format/skeleton.go) enables byte-exact roundtrip of documents. The reader writes skeleton entries as it parses; the writer reads them to reconstruct the output. Tools in between only see blocks — they never touch the skeleton.

See Skeleton Store for binary format and API details.

Reader Side: Coalescing Buffer Pattern

Do NOT write one skeleton entry per token. Use a bytes.Buffer to accumulate non-translatable text between block references, then flush before each ref:

// skelText appends text to the coalescing buffer.
func (r *Reader) skelText(s string) {
    if r.skeletonStore != nil {
        r.skelBuf.WriteString(s)
    }
}

// skelRef flushes accumulated text, then writes a block reference.
func (r *Reader) skelRef(id string) {
    if r.skeletonStore != nil {
        if r.skelBuf.Len() > 0 {
            r.skeletonStore.WriteText(r.skelBuf.Bytes())
            r.skelBuf.Reset()
        }
        r.skeletonStore.WriteRef(id)
    }
}

// skelFlush writes any remaining buffered text.
func (r *Reader) skelFlush() {
    if r.skeletonStore != nil && r.skelBuf.Len() > 0 {
        r.skeletonStore.WriteText(r.skelBuf.Bytes())
        r.skelBuf.Reset()
    }
}

This reduces skeleton entries from ~N (one per token) to ~2B+1 (where B is the number of translatable blocks). For example, a JSON file with 50 strings produces ~101 entries instead of ~10,000.

What Goes Where

Content	Skeleton	Block
Structural tokens (`\{`, `}`, `[`, `]`, `,`, `:`)	Text	--
Whitespace, comments, formatting	Text	--
Non-translatable values	Text	--
Object keys	Text	--
Translatable string values	Ref (block ID)	Source text
Embedded/subfiltered content	Ref (`layer:<path>`)	Child layer

The skeleton ref replaces the entire encoded value (e.g., including JSON quotes), and the writer is responsible for re-encoding the block text in the format's encoding (e.g., JSON string escaping).

Writer Fallback Chain

Always implement a fallback for when no skeleton store is wired (e.g., when the format is used outside the flow executor):

Skeleton store — byte-exact reconstruction (preferred)
Re-parse original — re-tokenize from saved original content, substitute blocks by path (good fidelity, requires holding original in memory)
Build from blocks — reconstruct from blocks alone (lowest fidelity, always works)

The JSON writer implements all three. The HTML writer implements skeleton + re-parse. Simpler formats may only need skeleton + build-from-blocks.

Registration

import <name>fmt "github.com/neokapi/neokapi/core/formats/<name>"

// In RegisterAll(reg *registry.FormatRegistry, opts ...RegisterOptions):
// RegisterReader takes (name, factory, FormatSignature, displayName).
reg.RegisterReader("<name>",
    func() format.DataFormatReader { return <name>fmt.NewReader() },
    format.FormatSignature{
        MIMETypes:  []string{"application/<name>"},
        Extensions: []string{".<ext>"},
    }, "<Display Name>")
reg.RegisterWriter("<name>", func() format.DataFormatWriter { return <name>fmt.NewWriter() })

Use an import alias if the package name conflicts with a Go builtin (e.g., xmlfmt, csvfmt).

Testing

Test Patterns

Use github.com/stretchr/testify (assert/require). Table-driven tests are the standard pattern. Place test data in a testdata/ subdirectory.

Roundtrip Test (byte-exact)

Read a file, pass blocks through unchanged, write output, compare:

func roundtrip(t *testing.T, input string) string {
    t.Helper()
    reader := NewReader()
    writer := NewWriter()
    // Open reader with input, drain parts, feed to writer
    // Assert output == input (byte-exact)
}

Skeleton Roundtrip Test

Same as roundtrip but with a SkeletonStore wired between reader and writer:

func roundtripWithSkeleton(t *testing.T, input string) string {
    t.Helper()
    reader := NewReader()
    writer := NewWriter()
    store, err := format.NewSkeletonStore()
    require.NoError(t, err)
    defer store.Close()
    reader.SetSkeletonStore(store)
    writer.SetSkeletonStore(store)
    // Open reader, drain parts, flush store, feed blocks to writer
    // Assert output == input (byte-exact)
}

Translation Roundtrip Test

Read, modify block targets, write, verify translated values appear:

func TestTranslation(t *testing.T) {
    // Read input
    // Set target text on blocks
    // Write with skeleton store
    // Verify output has translated values in correct positions
}

What to Test

Byte-exact roundtrip: Input == output when no translation is applied
Skeleton byte-exact roundtrip: Same, but with SkeletonStore wired
Translation roundtrip: Translated text appears at correct positions
Whitespace/formatting preservation: Indentation, trailing newlines, comments (if the format supports them)
Config variations: Each config option with representative inputs
Edge cases: Empty files, Unicode, escape sequences, nested structures
Subfilter roundtrip: Embedded content survives extraction and reconstruction

Porting Okapi Tests

When migrating an Okapi filter, port its test inventory:

Find the Okapi filter's test class (e.g., JSONFilterTest.java)
Copy test input files to testdata/
Create table-driven tests mapping to each Okapi test case
Convert Java assertions to Go assert/require calls
The Okapi gold files (.gold suffix) become expected outputs

Okapi test patterns map to neokapi as:

Okapi Pattern	neokapi Equivalent
`testRoundTrip(input)`	`roundtrip(t, input)` / `roundtripWithSkeleton(t, input)`
`testExtraction(input, events)`	Read + assert block count, text, properties
`testOutput(input, gold)`	Read + write + compare against expected output
`testDoubleExtraction(input)`	Read, write, read again, compare blocks

Reference Implementations

Format	Best for learning	Key patterns
JSON (`core/formats/json/`)	Key-value formats, regex-based config, subfilter support	Token walking, coalescing skeleton, 3-mode writer fallback, extensive config
HTML (`core/formats/html/`)	Markup/streaming formats, tokenizer-based parsing	Tokenizer dispatch, inline spans, per-block skeletons (`model.Block.Skeleton`)
Plaintext (`core/formats/plaintext/`)	Minimal format, starting point	Simplest possible reader/writer
XLIFF (`core/formats/xliff/`)	Bilingual exchange formats	SkeletonStore (coalescing buffer in reader, `writeFromSkeleton` in writer), segment/target handling
Properties (`core/formats/properties/`)	Line-oriented key-value formats	Line parsing, escape handling

Checklist

Before submitting a new format:

Terminology Mapping from Okapi​

File Structure​

Config​

Reader​

Read Method Pattern​

Block Creation​

Subfilter Support​

Writer​

Write Method Pattern​

Skeleton Store Reconstruction​

Write-side post-processing: the no-regex convention​

Skeleton Store Integration​

Reader Side: Coalescing Buffer Pattern​

What Goes Where​

Writer Fallback Chain​

Registration​

Testing​

Test Patterns​

Roundtrip Test (byte-exact)​

Skeleton Roundtrip Test​

Translation Roundtrip Test​

What to Test​

Porting Okapi Tests​

Reference Implementations​

Checklist​