Skip to main content

Implementing a New Format

This guide explains how to add a new document format to neokapi, from basic readers and writers to full inline code support with roundtrip fidelity.

Structure

Create a package under core/formats/ with these files:

core/formats/myformat/
├── reader.go # DataFormatReader implementation
├── writer.go # DataFormatWriter implementation
├── config.go # Format-specific configuration
├── reader_test.go # Extraction and roundtrip tests
└── testdata/ # Sample files for testing

Reader

The reader must implement format.DataFormatReader. Embed format.BaseFormatReader for shared behavior:

package myformat

import (
"context"
"github.com/neokapi/neokapi/core/format"
"github.com/neokapi/neokapi/core/model"
)

type Reader struct {
format.BaseFormatReader
}

func NewReader() *Reader {
return &Reader{
BaseFormatReader: format.BaseFormatReader{
FormatName: "myformat",
FormatDisplayName: "My Format",
FormatMimeType: "application/x-myformat",
FormatExtensions: []string{".myf"},
},
}
}

func (r *Reader) Signature() format.FormatSignature {
return format.FormatSignature{
MIMETypes: []string{"application/x-myformat"},
Extensions: []string{".myf"},
}
}

func (r *Reader) Open(ctx context.Context, doc *model.RawDocument) error {
if doc == nil || doc.Reader == nil {
return fmt.Errorf("myformat: nil document or reader")
}
r.Doc = doc
return nil
}

func (r *Reader) Read(ctx context.Context) <-chan model.PartResult {
ch := make(chan model.PartResult, 64)
go func() {
defer close(ch)

// 1. Emit PartLayerStart
ch <- model.PartResult{Part: &model.Part{
Type: model.PartLayerStart,
Resource: &model.Layer{ID: "doc1", Format: "myformat"},
}}

// 2. Emit Blocks for translatable content
ch <- model.PartResult{Part: &model.Part{
Type: model.PartBlock,
Resource: model.NewBlock("b1", "Hello"),
}}

// 3. Emit PartLayerEnd
ch <- model.PartResult{Part: &model.Part{
Type: model.PartLayerEnd,
Resource: &model.Layer{ID: "doc1", Format: "myformat"},
}}
}()
return ch
}

func (r *Reader) Close() error {
if r.Doc != nil && r.Doc.Reader != nil {
return r.Doc.Reader.Close()
}
return nil
}

The example above emits plain text. Most real-world formats contain inline markup (bold, links, images) that must be preserved through the pipeline — see Inline Code Handling below.

Writer

The writer must implement format.DataFormatWriter. Embed format.BaseFormatWriter:

type Writer struct {
format.BaseFormatWriter
}

func NewWriter() *Writer {
return &Writer{
BaseFormatWriter: format.BaseFormatWriter{FormatName: "myformat"},
}
}

func (w *Writer) Write(ctx context.Context, parts <-chan *model.Part) error {
for {
select {
case <-ctx.Done():
return ctx.Err()
case part, ok := <-parts:
if !ok {
return nil
}
switch part.Type {
case model.PartBlock:
block := part.Resource.(*model.Block)
// Write translated content — see renderRuns below
case model.PartData:
// Write structural content verbatim
}
}
}
}

Inline Code Handling

Most document formats contain inline markup — bold, italic, links, images, line breaks, variables, placeholders. A localization framework must preserve this markup through the entire pipeline (extraction, TM lookup, MT, AI translation, QA, reconstruction) without corruption.

neokapi solves this with the Run model: a block's content is a flat []model.Run sequence. Text travels as TextRuns; inline markup becomes inline-code runs (PcOpen/PcClose for paired tags, Ph for self-closing tokens) that carry the original markup in a Data field. This lets tools, translation engines, and TM matching project the runs to plain text, and the writer reconstructs the original markup by re-emitting each run's Data.

The Run Model

A Run is a discriminated union — exactly one of its pointer fields is set:

type Run struct {
Text *TextRun // plain text chunk
Ph *PlaceholderRun // self-closing: variable, icon, <br>, redaction
PcOpen *PcOpenRun // opening half of a paired code (<a>, <b>, …)
PcClose *PcCloseRun // closing half of a paired code (</a>, </b>, …)
Sub *SubRun // reference to a nested Block (subfilter output)
Plural *PluralRun // ICU plural with per-form Runs
Select *SelectRun // ICU select with per-case Runs
}

The three inline-code runs you reach for most are:

// PlaceholderRun — a self-closing token (<br/>, {count}, an icon).
type PlaceholderRun struct {
ID string // unique within the run sequence
Type string // semantic type (e.g., "fmt:linebreak", "var")
SubType string // optional refinement
Data string // original markup verbatim (e.g., "<br/>")
Equiv string // plain-text equivalent (e.g., "\n")
Disp string // editor display label (e.g., "[BR]")
Constraints *RunConstraints // deletable / cloneable / reorderable
}

// PcOpenRun — the opening half of a paired code. PcCloseRun mirrors it
// (sharing ID) but omits Disp and Constraints — the close inherits the
// opener's behavior.
type PcOpenRun struct {
ID string
Type string // e.g., "fmt:bold", "fmt:link"
SubType string
Data string // e.g., "<b>", "<a href=\"/help\">"
Equiv string
Disp string
Constraints *RunConstraints
}

RunConstraints is the editing-policy triple:

type RunConstraints struct {
Deletable bool // translator may remove this code
Cloneable bool // translator may duplicate this code
Reorderable bool // this code may move relative to others
}

How It Works

Consider this HTML paragraph:

<p>Click <b>here</b> for <a href="/help">info</a></p>

The reader extracts the <p> content as a single segment whose Runs are:

[
{Text: "Click "},
{PcOpen: {ID: "1", Type: "fmt:bold", Data: "<b>"}},
{Text: "here"},
{PcClose: {ID: "1", Type: "fmt:bold", Data: "</b>"}},
{Text: " for "},
{PcOpen: {ID: "2", Type: "fmt:link", Data: "<a href=\"/help\">"}},
{Text: "info"},
{PcClose: {ID: "2", Type: "fmt:link", Data: "</a>"}},
]

The runs are ordered, and a PcClose shares its ID with the matching PcOpen. This means:

  • block.SourceText() returns "Click here for info" (inline-code runs contribute nothing)
  • block.SourceRuns() contains the PcOpen/PcClose pairs above
  • the second run's PcOpen.Data is "<b>" (the original markup, including attributes)

Tools project the runs to plain text and skip the inline codes. Translation engines get clean text with opaque tokens. The writer re-emits each run's Data to reconstruct the original markup perfectly — even preserving attributes like class="emphasis" or href="/help".

Three Categories of Inline Elements

When implementing a format reader, classify each inline element into one of three categories:

CategoryRun kindExamplesPattern
Paired tagsPcOpen + PcClose<b>...</b>, **...**, <a>...</a>Wrap content with two runs (shared ID)
Self-closingPh<br/>, <img>, <hr/>Single run, no children
Block-level(not a run)<p>, <div>, <h1>Boundary for a new Block

The reader decides what is inline vs. block-level. For HTML, this distinction is well-defined. For other formats (Markdown, XLIFF, custom XML), you choose the mapping based on what a translator needs to see as a contiguous unit.

Complete Reader Example with Inline Codes

Here is how a reader collects inline content from a block-level element into a []model.Run slice. This pattern applies to any format with inline markup:

// collectInlineContent builds a run sequence from all text and inline
// elements inside a block-level container node.
func (r *Reader) collectInlineContent(n *html.Node) []model.Run {
var runs []model.Run
r.collectFromNode(n, &runs)
return runs
}

// appendText coalesces adjacent text so consecutive chunks stay one TextRun.
func appendText(runs *[]model.Run, text string) {
if text == "" {
return
}
if n := len(*runs); n > 0 && (*runs)[n-1].Text != nil {
(*runs)[n-1].Text.Text += text
return
}
*runs = append(*runs, model.Run{Text: &model.TextRun{Text: text}})
}

func (r *Reader) collectFromNode(n *html.Node, runs *[]model.Run) {
for child := n.FirstChild; child != nil; child = child.NextSibling {
switch child.Type {
case html.TextNode:
// Plain text — coalesce into the run sequence
appendText(runs, child.Data)

case html.ElementNode:
if selfClosingElements[child.DataAtom] {
// Self-closing: <br/>, <img>, etc. → a Ph run
*runs = append(*runs, model.Run{Ph: &model.PlaceholderRun{
ID: r.nextID(),
Type: child.Data,
Data: renderTag(child), // e.g., "<br/>"
}})
} else if isInlineElement(child) {
// Paired inline: <b>, <a>, <em>, etc.
id := r.nextID()
*runs = append(*runs, model.Run{PcOpen: &model.PcOpenRun{
ID: id,
Type: child.Data,
Data: renderOpenTag(child), // e.g., "<a href=\"/help\">"
}})
r.collectFromNode(child, runs) // Recurse into children
*runs = append(*runs, model.Run{PcClose: &model.PcCloseRun{
ID: id, // shares ID with its PcOpen
Type: child.Data,
Data: fmt.Sprintf("</%s>", child.Data),
}})
}
// Block-level elements are NOT collected — they form new Blocks
}
}
}

The key insight: recurse into inline children to handle nested formatting like <b><i>bold italic</i></b>. Each level of nesting appends its own PcOpen/PcClose pair, and the flat run sequence naturally captures the correct order. Attach the collected runs to a block with block.SetSourceRuns(runs), or build the block directly with model.NewRunsBlock(id, runs).

Reconstructing Markup in a Writer

The writer walks the run sequence and emits each run's content: literal text for TextRuns, the captured Data for inline-code runs. The framework provides model.RenderRunsWithData for exactly this — the canonical rendering path the HTML, XML, and Markdown writers all use:

func (w *Writer) renderRuns(buf *strings.Builder, runs []model.Run) {
// RenderRunsWithData emits TextRun content verbatim and re-emits the
// captured Data for every inline-code run (Ph, PcOpen, PcClose, Sub).
buf.WriteString(model.RenderRunsWithData(runs))
}

This approach guarantees perfect roundtrip fidelity — the writer doesn't need to understand the markup format. It just replays whatever Data the reader stored. An <a href="/help" class="nav"> tag roundtrips as exactly that string, attributes and all.

Choosing Target vs Source Content

When writing output, the writer must choose the right content. Use the target runs if a translation exists for the configured locale, otherwise fall back to the source runs:

func (w *Writer) writeBlock(block *model.Block) {
if !w.Locale.IsEmpty() && block.HasTarget(w.Locale) {
// Write translated content (preserving inline codes)
w.renderRuns(buf, block.TargetRuns(w.Locale))
} else {
// Fall back to source
w.renderRuns(buf, block.SourceRuns())
}
}

Using Skeletons for Document Structure

Block-level structure that surrounds translatable content (opening and closing tags, whitespace, etc.) is captured in a Skeleton. The reader builds a skeleton with text parts and a reference to the block content:

block := model.NewRunsBlock("tu1", runs)
block.Skeleton = &model.Skeleton{
Strategy: model.SkeletonFragmentBased,
Parts: []model.SkeletonPart{
&model.SkeletonText{Text: "<p>"}, // Before content
&model.SkeletonRef{ResourceID: "tu1"}, // Content placeholder
&model.SkeletonText{Text: "</p>\n"}, // After content
},
}

The writer uses the skeleton to reconstruct the document:

if block.Skeleton != nil {
for _, sp := range block.Skeleton.Parts {
switch p := sp.(type) {
case *model.SkeletonText:
fmt.Fprint(w.Output, p.Text) // Emit structure verbatim
case *model.SkeletonRef:
// Emit the translated/source runs with inline codes
w.renderRuns(buf, runs)
}
}
}

Skeletons are critical for roundtrip fidelity of the block-level document structure. Without them, the writer would need to re-generate all surrounding tags, whitespace, and attributes — which risks losing information.


Run Metadata Fields

An inline-code run carries more than just the raw markup. These fields help tools, editors, and QA checks work with inline codes intelligently:

FieldPurposeExample
discriminatorWhich field is set: PcOpen, PcClose, or Pha PcOpen
TypeSemantic type for tool processing"fmt:bold", "fmt:link", "var"
IDMatches an opening run to its closing run"1" shared by the <b>/</b> pair
DataOriginal markup for roundtrip reconstruction"<a href=\"/help\">"
DispUI label in translation editors"[B]", "[/B]", "[IMG]"
EquivPlain text equivalent"\n" for <br>
ConstraintsEditing policy (Deletable/Cloneable/Reorderable)non-deletable for a {count} variable

Set these fields in the reader when you have the information. At minimum, set the discriminator, Type, ID, and Data. The other fields enhance the experience for translators and tools but are optional.


Configuration

DataFormatConfig requires four methods — FormatName(), Reset(), Validate(), and ApplyMap(values map[string]any) error. ApplyMap applies config values from a map and rejects unknown keys and type mismatches. The format.ApplyMapViaJSON helper (core/format/applymap.go) implements this for struct configs via JSON marshal/unmarshal with DisallowUnknownFields, while configs with complex parsing can hand-write a switch-based ApplyMap (see core/formats/json/config.go). The helper requires the struct's fields to carry matching yaml/json tags for the incoming keys.

type Config struct {
Encoding string `yaml:"encoding" json:"encoding"`
}

func (c *Config) FormatName() string { return "myformat" }
func (c *Config) Reset() { c.Encoding = "UTF-8" }
func (c *Config) Validate() error { return nil }

func (c *Config) ApplyMap(values map[string]any) error {
return format.ApplyMapViaJSON(c, values)
}

Registration

Add your format inside formats.RegisterAll() in core/formats/register.go. RegisterReader takes the format name, a reader factory, a FormatSignature for detection, and a display name; RegisterWriter takes the name and a writer factory:

// In RegisterAll(reg *registry.FormatRegistry, opts ...RegisterOptions):
reg.RegisterReader("myformat",
func() format.DataFormatReader { return myformat.NewReader() },
format.FormatSignature{
MIMETypes: []string{"application/x-myformat"},
Extensions: []string{".myf"},
}, "My Format")
reg.RegisterWriter("myformat", func() format.DataFormatWriter {
return myformat.NewWriter()
})

Testing

Extraction Tests

Verify that the reader correctly identifies translatable content and inline codes:

func TestReadInlineRuns(t *testing.T) {
ctx := context.Background()
reader := NewReader()
err := reader.Open(ctx, testutil.RawDocFromString(
`<html><body><p>Click <b>here</b> for info</p></body></html>`,
model.LocaleEnglish,
))
require.NoError(t, err)
defer reader.Close()

blocks := testutil.CollectBlocks(t, reader.Read(ctx))
require.GreaterOrEqual(t, len(blocks), 1)

// Plain text is the source runs with inline markup stripped.
assert.Equal(t, "Click here for info", blocks[0].SourceText())

// Inline codes are preserved as a PcOpen/PcClose pair on the source runs.
// (There is no Segment type — segmentation is an opt-in overlay, AD-002.)
runs := blocks[0].SourceRuns()
require.Len(t, runs, 4) // "Click ", <b>, "here", </b> + trailing text coalesces
require.NotNil(t, runs[1].PcOpen)
assert.Equal(t, "<b>", runs[1].PcOpen.Data)
require.NotNil(t, runs[3].PcClose)
assert.Equal(t, "</b>", runs[3].PcClose.Data)
assert.Equal(t, runs[1].PcOpen.ID, runs[3].PcClose.ID) // shared ID
}

Test each type of inline element your format supports:

func TestReadPlaceholderRun(t *testing.T) {
// Self-closing elements become Ph runs
reader := NewReader()
reader.Open(ctx, testutil.RawDocFromString(
`<html><body><p>Line one<br/>Line two</p></body></html>`,
model.LocaleEnglish,
))
defer reader.Close()

blocks := testutil.CollectBlocks(t, reader.Read(ctx))
runs := blocks[0].SourceRuns()

assert.Equal(t, "Line oneLine two", blocks[0].SourceText())
// The <br/> is a single Ph run between the two text runs.
require.NotNil(t, runs[1].Ph)
assert.Equal(t, "br", runs[1].Ph.Type)
}

func TestReadLinkRun(t *testing.T) {
// Links preserve href in PcOpen.Data
reader := NewReader()
reader.Open(ctx, testutil.RawDocFromString(
`<html><body><p>Visit <a href="http://example.com">our site</a></p></body></html>`,
model.LocaleEnglish,
))
defer reader.Close()

blocks := testutil.CollectBlocks(t, reader.Read(ctx))
runs := blocks[0].SourceRuns()

assert.Equal(t, "Visit our site", blocks[0].SourceText())
require.NotNil(t, runs[1].PcOpen)
assert.Contains(t, runs[1].PcOpen.Data, "href") // Attributes preserved
}

Roundtrip Tests

The gold standard: read a file, write it back, compare with the original. This proves that inline codes survive the full read-write cycle:

func TestRoundTrip(t *testing.T) {
original, err := os.ReadFile("testdata/sample.myf")
require.NoError(t, err)

ctx := context.Background()
reader := NewReader()
err = reader.Open(ctx, testutil.RawDocFromReader(
bytes.NewReader(original), "testdata/sample.myf", model.LocaleEnglish))
require.NoError(t, err)
parts := testutil.CollectParts(t, reader.Read(ctx))
reader.Close()

var buf bytes.Buffer
writer := NewWriter()
writer.SetOutputWriter(&buf)
writer.Write(ctx, testutil.PartsToChannel(parts))
writer.Close()

assert.Equal(t, string(original), buf.String())
}

Translation Roundtrip Tests

Verify that translated content with inline codes writes correctly:

func TestTranslationRoundTrip(t *testing.T) {
ctx := context.Background()
reader := NewReader()
reader.Open(ctx, testutil.RawDocFromString(
`<html><body><p>Click <b>here</b></p></body></html>`,
model.LocaleEnglish,
))
parts := testutil.CollectParts(t, reader.Read(ctx))
reader.Close()

// Build a translated run sequence with the same inline codes.
for _, p := range parts {
if p.Type == model.PartBlock {
block := p.Resource.(*model.Block)

block.SetTargetRuns(model.LocaleFrench, []model.Run{
{Text: &model.TextRun{Text: "Cliquez "}},
{PcOpen: &model.PcOpenRun{ID: "1", Type: "fmt:bold", Data: "<b>"}},
{Text: &model.TextRun{Text: "ici"}},
{PcClose: &model.PcCloseRun{ID: "1", Type: "fmt:bold", Data: "</b>"}},
})
}
}

var buf bytes.Buffer
writer := NewWriter()
writer.SetOutputWriter(&buf)
writer.SetLocale(model.LocaleFrench)
writer.Write(ctx, testutil.PartsToChannel(parts))

assert.Contains(t, buf.String(), "Cliquez <b>ici</b>")
}

See Testing for more patterns.


Inline Code Patterns by Format

Different formats map to the same Run model in different ways:

HTML / XML

Block-level elements (<p>, <div>, <h1>) are Block boundaries. Inline elements (<b>, <a>, <em>, <span>) become PcOpen/PcClose pairs. Void elements (<br>, <img>) become Ph runs.

Input: <p>Click <b>here</b> for <a href="/help">info</a></p>
Text: "Click here for info"
Runs: [text, PcOpen <b>, text, PcClose </b>, text, PcOpen <a href="/help">, text, PcClose </a>]

Markdown

Emphasis markers (*, **, `) become PcOpen/PcClose pairs. Links have the URL stored in the opening run's Data field.

Input: Click **here** for [info](/help)
Text: "Click here for info"
Runs: [text, PcOpen **, text, PcClose **, text, PcOpen [, text, PcClose ](/help)]

XLIFF / Translation Formats

XLIFF <pc> maps to a PcOpen/PcClose pair, <bpt>/<ept> (begin/end paired tag) likewise. <ph> (placeholder) and <it> (isolated tag) map to a Ph run. The original XLIFF inline markup goes into the run's Data.

Templating / Variables

Template variables like \{name\} or $\{count\} become Ph runs. The full variable expression goes into Data:

Input: Hello {name}, you have {count} items
Text: "Hello , you have items"
Runs: [text, Ph {name}, text, Ph {count}, text]

Formats Without Inline Codes

Formats like JSON, YAML, or .properties typically don't have inline markup. Use model.NewBlock(id, text) to create a block with a single plain TextRun. If these formats contain embedded HTML or Markdown, use nested Layers (see Architecture) to delegate inline handling to the appropriate sub-format reader.