Skip to main content

AD-002: Content Model

Summary

Documents in neokapi are represented as a stream of Part values, each carrying a PartType discriminator and a Resource. Translatable content is a Block; a Block's content is a flat []Run per locale — a discriminated union (Text, Ph, PcOpen, PcClose, Sub, Plural, Select). Each inline-code run carries its metadata (native markup, abstract identity, semantic type, display text, text equivalent, editing constraints) directly on its fields, so format-aware tools can process content semantically while writers roundtrip native markup exactly. Interpretations of that content — sentence segmentation, terminology, entities, QA findings — are stand-off overlays: typed, run-anchored span sets layered over the runs on demand, never rewriting them. Segment is not a structural type; a segment is a span in a segmentation overlay. Targets are first-class records keyed by a variant — locale plus optional tone or channel — not bare locale-keyed strings.

Context

A localization content model must represent translatable documents in a way that is format-independent, type-safe, extensible, and able to represent recursive embedded content naturally. Go's composition and interface system (no class inheritance) shapes the design toward discriminated unions and explicit resource types rather than deep type hierarchies. Both the Part stream and the inline-content model are discriminated unions — one keyed by PartType, the other by which Run field is set.

Beyond structural representation, real-world localization workflows demand:

  • Stable content identity across extraction cycles for incremental processing.
  • Dynamic properties for extensible metadata.
  • Display hints that guide UI rendering without coupling the model to any particular frontend.
  • A format-independent inline code model that supports TM matching, AI translation, and editor rendering across all source formats.

The inline-code challenge

Documents contain inline formatting (bold, italic, links, images, variables, placeholders) embedded within translatable text, and every source format represents these constructs differently:

ConceptHTMLMarkdownDOCX (OpenXML)XLIFF 2.0
Bold<b>**<w:b/><pc type="fmt" subType="xlf:b">
Link<a href="…">[text](url)<w:hyperlink><pc type="link">
Line break<br/>two spaces + newline<w:br/><ph type="fmt" subType="xlf:lb"/>
Placeholder<ph>

A framework must make these constructs processable in a format-agnostic way — TM matching, AI translation, QA checks, and terminology lookup must not need to know whether the bold text came from HTML or Markdown. At the same time, perfect roundtrip fidelity to the original format is required: a <b class="emphasis"> must roundtrip as exactly that, not as a generic bold tag.

The embedded-content challenge

Documents frequently contain embedded content in a different format: HTML strings inside JSON values, HTML in CDATA sections of XML, Markdown in CSV columns. A JSON reader that only sees "<p>Hello <b>world</b></p>" as a flat string misses the inline formatting and produces inferior translation results.

Decision

Part and Resource

A single Part struct carries a PartType enum and a Resource:

type Part struct {
Type PartType
Resource Resource // Block, Layer, Group, Data, or Media
}

type Resource interface {
ResourceID() string
}

PartType values are PartTypeUnknown (the zero value), PartLayerStart, PartLayerEnd, PartGroupStart, PartGroupEnd, PartBlock, PartData, PartMedia, PartRawDocument, and PartCustom. Constants carry explicit integer values for wire compatibility, so a few slots are reserved for retired batch types and are not renumbered.

Resource types:

  • Layer — structural grouping (document, section, embedded content), delimited by PartLayerStart/PartLayerEnd.
  • Group (GroupStart/GroupEnd) — a nested structural group within a layer (e.g. a table), delimited by PartGroupStart/PartGroupEnd.
  • Block — translatable content: a source run sequence and per-locale target run sequences, plus optional stand-off overlays.
  • Data — non-translatable structure (skeleton, metadata).
  • Media — binary content (images, embedded files).

PartResult{Part, Error} carries both content and errors on the same channel, letting tools decide how to handle errors (skip, retry, fail) without maintaining separate error channels.

Block

type Block struct {
ID string
SourceLocale LocaleID
Source []Run // whole source content
Targets map[VariantKey]*Target // first-class targets, keyed by variant
Overlays []Overlay // positional, run-anchored stand-off layers
Annotations map[string]any // block-scoped typed metadata, keyed by type
Identity *BlockIdentity // content-addressable hash for dedup
Properties map[string]string // opaque pass-through metadata only
// …skeleton link, display hint, whitespace flag, etc.
}

A Block holds one source run sequence and one target run sequence per locale — the whole content, unsegmented. There is no Segment container: most blocks are a single string and its translations, and the model says exactly that. When a workflow needs sentence boundaries (review UI, exact-match TM keys, XLIFF/TMX export), a flow tool computes them and attaches a segmentation overlay (see Stand-off overlays). The overlay is layered over the runs; the runs are never repartitioned, so segmentation is reversible by construction — dropping the overlay restores the unsegmented content with no loss.

Targets and variants

A target is a first-class record, not a bare run sequence, and it is keyed by a variant rather than a bare locale:

// VariantKey identifies a target variant. Locale is the only required
// dimension; tone, channel, and any future axis are optional and zero-valued
// by default, so the common case carries no extra ceremony.
type VariantKey struct {
Locale LocaleID // required
Tone string // optional ("" = unspecified)
Channel string // optional
}

// Target is the committed translation for one variant: content plus its
// lifecycle and provenance. Candidate/alternative translations (TM, MT, AI
// proposals) remain `alt-translation` annotations; Target is the chosen one.
type Target struct {
Runs []Run
Status TargetStatus // draft | translated | reviewed | signed-off
Origin Origin // human | tm | mt | ai, plus engine, tool, timestamp, author
Score float64
}

Ergonomic accessors keep the locale-only path a one-liner: block.Target("fr-FR") resolves VariantKey{Locale: "fr-FR"}, while block.TargetVariant(key) reaches the general case. Code that only knows about locales never has to mention tone or channel — richer variants are strictly opt-in. A Target's Runs carry their own overlays (target-side segmentation, target terms), scoped to that variant.

This separates two things the older map[LocaleID][]Run conflated: the committed translation per variant (a Target, with status and provenance) from candidate proposals (alt-translation annotations, possibly many per variant, each scored). The in-flight Target holds the current committed translation; accumulated history across runs and review trails are a persistence-layer concern, outside the content model.

Content-addressable identity

type BlockIdentity struct {
ContentHash string // SHA-256 of normalized source text
ContextHash string // SHA-256 of contextual information (name, type, properties)
}

The ContentHash is computed from normalized source text (whitespace-normalized). Combined with ContextHash — a SHA-256 over the block's name, type, and sorted properties — this produces stable identity across extraction cycles: the same content always produces the same identity, so only blocks whose identity has changed need reprocessing. Identical blocks across documents share the same ContentHash, letting translation memory and AI tools avoid redundant work.

Block identity also carries a separate project-unique internal ID tracked by the store layer — see AD-003: Identity for the dual-ID scheme.

Dynamic properties

The Properties map carries arbitrary key-value metadata that tools and connectors attach as blocks flow through the pipeline. Examples:

  • "translation-origin": "tm" — how the translation was produced
  • "word-count": 42 — count from the wordcount tool
  • "cms-path": "/en/blog/post-1" — source location in a CMS

Properties are serialized and carried through the pipeline. Tools add metadata without content-model changes. This replaces the pattern of adding dedicated fields for every new piece of metadata.

Overlays and Annotations (two stand-off carriers)

Typed stand-off interpretations of a Block come in two kinds, kept in two fields because they differ in shape and lifecycle:

  • Overlays are positional: run-anchored span sets that point into the content — segmentation, terminology, entities, term candidates, QA findings, source↔target alignment. Each Span carries a run Range (the position) and an optional typed payload Value. Because their ranges anchor to runs, a source rewrite shifts them. A transformer's edit plan is a structured span→replacement map, so the framework applier rebases the survivors onto the new runs with model.RemapOverlays (spans overlapping an edit are dropped; the rest shift to follow it); an opaque whole-block rewrite has no mapping and drops them. See AD-006.
  • Annotations are block-scoped: a keyed map of typed payloads describing the block as a whole, with no position — alt-translations, notes, analysis results (word/char/segment counts, comparison, repetition, brand-voice), and format round-trip state. A source rewrite does not invalidate them.
type Block struct {
// …
Overlays []Overlay // positional, run-anchored
Annotations map[string]any // block-scoped, keyed by type
}

type Overlay struct {
Type OverlayType // "segmentation","term","entity","qa","alignment",…
Variant *VariantKey // nil = source side
Layer string // segmentation granularity; LayerPrimary = primary
Spans []Span // each with a run Range and a typed Value
}

Whether an interpretation is positional is structural — it is an Overlay — not a runtime flag. Annotations are reached through the Anno/SetAnno/DelAnno/AnnoMap helpers (keyed by type); overlays through OverlayOf/AddOverlaySpan/OverlaySpan/RemoveOverlay.

InterpretationCarrier / typeProducerPurpose
Segmentationoverlay segmentationsegment annotatorSentence / chunk boundaries over runs
Terminologyoverlay termterm-lookupMatched terminology with target terms
Term candidatesoverlay term-candidateai-terminologyTerm extraction candidates from an LLM
Entitiesoverlay entityai-entity-extractNamed entities (people, places, dates)
QA findingsoverlay qaqa-checkQuality findings with severity
Alignmentoverlay alignmentaligner, readersSource-span ↔ target-span links
Alt-translationsannotation alt-translationTM leverage, AI toolsCandidate translations with scores

Both overlay span values and annotation values are typed payloads registered with a single payload registry (RegisterPayload / NewPayload) so the wire and store layers can rehydrate the typed value from its type name.

Properties is opaque pass-through metadata only (connector keys, format hints). Analytic/interpretive results that a tool produces — TM match scores, word counts, QA findings — are overlays or annotations, not properties; the IO contract (AD-006) declares which a tool consumes and produces.

Tools communicate by reading the overlays and annotations produced upstream and writing their own downstream, keeping tools loosely coupled through the shared data model rather than direct dependencies.

The Run sequence

A Block's content is not a string with embedded markers — it is a flat sequence of Run values, held directly on the Block (Source, and each Targets[variant].Runs):

Run is a discriminated union: exactly one of its pointer fields is set, which is the run's kind. Run.Kind() returns the discriminator and Run.RunID() returns the id for the kinds that carry one.

type Run struct {
Text *TextRun // plain text chunk
Ph *PlaceholderRun // self-closing: variable, icon, <br>, redaction
PcOpen *PcOpenRun // opening half of a paired code (<a>, <b>, …)
PcClose *PcCloseRun // closing half of a paired code (</a>, </b>, …)
Sub *SubRun // reference to a nested Block (subfilter output)
Plural *PluralRun // ICU plural with per-form Runs
Select *SelectRun // ICU select with per-case Runs
}

Text and inline codes interleave positionally in the slice; there is no parallel side-table and no marker characters. A reader builds the slice by appending a TextRun for each text chunk and an inline-code run for each construct it encounters (see core/formats/*/run_builder.go).

Stand-off overlays (segmentation, terminology, entities)

The runs are the content. Everything that interprets the content — where the sentence boundaries fall, which spans are terms or named entities, what a QA check flagged, how a source span aligns to a target span — is a stand-off overlay: a typed set of spans anchored to a run sequence, layered over it without rewriting it. This is the one mechanism for every positional interpretation; segmentation is simply the overlay of type segmentation.

// Overlay is a typed stand-off layer over one side of a Block.
type Overlay struct {
Type OverlayType // "segmentation" | "term" | "entity" | "qa" | "alignment" | …
Variant *VariantKey // which run sequence it annotates; nil = source
Layer string // segmentation granularity; LayerPrimary = primary sentence layer
Spans []Span
}

type Span struct {
ID string // overlay-local id (e.g. a segment id "s1")
Range RunRange // run-anchored, never a flattened-string offset
Props map[string]string // type-specific: ignorable marker, term/entity payload, score, alignment ref
}

// RunRange anchors a span on the run sequence — start and end run indices plus
// an intra-text-run character offset — so boundaries stay stable across inline
// codes and survive run-preserving edits.
type RunRange struct {
StartRun, StartOffset int
EndRun, EndOffset int
}

Four properties follow from anchoring interpretations to runs rather than baking them into structure:

  • Segmentation is opt-in and dynamic. The segmentation flow tool computes boundaries and writes a segmentation overlay; nothing runs it by default. Whole-block content is the norm, which is also what document-level AI translation wants for coherence. The tool delegates to a pluggable segmenter engine selected by an engine: config field, drawn from a registry (core/segment) that mirrors the AI/MT provider registries:

    • srx (default) — a faithful SRX 2.0 rule engine (cascading break/no-break rules with language maps, lookaround via dlclark/regexp2, and SourceSrxPath/TargetSrxPath overrides). It is adaptive, reproducing Okapi's own model: Okapi's defaultSegmentation.srx declares useIcu4jBreakRules="yes" — ICU does the breaking and the ~2,800 SRX rules act as no-break exceptions. So where a UAX-29 base breaker is linked (cgo/ICU), the srx engine loads Okapi's full 14-language ruleset and runs that ICU-base + SRX-exception hybrid, verified byte-for-byte against the real Okapi SRXSegmenter (TestSRXParityWithOkapi, golden via make regen-srx-parity-golden). Where no base breaker is linked (nocgo/WASM/browser), it falls back to a reduced, self-contained pure-Go ruleset with explicit break rules — the only segmenter that runs in the browser. The base breaker is resolved at runtime through a core/segment registry (the cgo uax29 package registers one), so the pure-Go srx package never imports cgo;
    • uax29 — the bare ICU Unicode sentence baseline (no exception rules), where cgo/ICU is linked;
    • llm — semantic chunking via an AI provider, writing the llm-chunk layer;
    • sat — the wtpsplit Segment any Text ML model, run in-process inside the out-of-process kapi-sat plugin so its native ONNX stack stays out of the portable binary.

    Engines not linked into a binary are simply absent — selecting one reports a clear error rather than failing the build. Several segmentations can coexist (e.g. sentence and a token-budgeted llm-chunk), each its own overlay, distinguished by the overlay's Layer.

  • It is reversible by construction. Desegmentation is "drop the overlay." There is no inverse operation to get wrong and no inter-segment "ignorable" material to lose — the gaps between segment spans are simply runs no span covers. By default the engines trim leading/trailing whitespace from each segment (matching Okapi's defaultSegmentation.srx, and keeping TM keys stable), so the inter-sentence whitespace is exactly such an uncovered gap rather than being attached to either side of the break. The SRX engine reads this from the ruleset header (okpsrx:options trimLeadingWhitespaces / trimTrailingWhitespaces); the segmentation tool defaults it on for every engine. Trimming is opt-out (trim*Whitespace: false) where a caller wants the raw runs.

  • It is uniform. Terminology, entities, and QA findings are the same kind of overlay, anchored the same run-aware way, rather than each re-detecting boundaries at render time.

  • Leverage is hybrid. TM matching works at the whole-block level (including embedding/semantic similarity) and, when a segmentation overlay is present, also computes exact and structural keys per segment span via the Run projections below.

Inline-code runs carry the former span "layers"

The earlier content model spread inline-code metadata across six "span layers." Those layers now live directly on the inline-code run structs. PlaceholderRun and PcOpenRun carry the full set:

type PcOpenRun struct {
ID string // abstract identity; shared with the matching PcClose
Type string // semantic type from the vocabulary ("fmt:bold")
SubType string // format-specific refinement ("html:b", "xlf:b")
Data string // native markup, replayed verbatim by writers ("<b class='x'>")
Equiv string // plain-text equivalent ("" for bold, "\n" for <br>)
Disp string // translator-facing label ("[B]")
Constraints *RunConstraints // editing constraints
}

type RunConstraints struct {
Deletable bool
Cloneable bool
Reorderable bool
}

PlaceholderRun has the same shape (it is self-closing, so it has no pairing partner). PcCloseRun is the closing half of a paired code and is leaner — it shares ID with its PcOpen and replays its own Data, but it has no Constraints field because the closing half inherits its opener's behavior. The six concerns map onto these fields as follows:

Former span "layer"Run field
Abstract identityID (+ Kind())
Semantic typeType, SubType
Native markupData
Display textDisp
Text equivalentEquiv
Editing constraintsConstraints

SubRun references a subblock produced by a subfilter (ID, Ref, Equiv); PluralRun / SelectRun are structured ICU constructs whose branches are themselves []Run.

Semantic type vocabulary

The Type field uses a defined vocabulary of format-independent semantic types, grouped into categories by namespace prefix:

Formatting (fmt:):

TypeMeaningHTMLMarkdownDOCX
fmt:boldBold text<b>, <strong>**<w:b/>
fmt:italicItalic text<i>, <em>*, _<w:i/>
fmt:underlineUnderlined text<u><w:u/>
fmt:strikethroughStruck-through text<s>, <del>~~<w:strike/>
fmt:subscriptSubscript text<sub><w:vertAlign w:val="subscript"/>
fmt:superscriptSuperscript text<sup><w:vertAlign w:val="superscript"/>
fmt:codeInline code<code>`
fmt:highlightHighlighted text<mark><w:highlight/>

Linking (link:):

TypeMeaningHTMLMarkdown
link:hyperlinkHyperlink<a href="…">[text](url)
link:crossrefCross-reference<a href="#id">[text](#anchor)
link:emailEmail link<a href="mailto:…">

Media (media:):

TypeMeaningHTMLMarkdown
media:imageInline image<img>![alt](url)
media:videoInline video<video>
media:audioInline audio<audio>

Structure (struct:):

TypeMeaningHTMLMarkdown
struct:breakLine break<br> \n
struct:pagebreakPage break
struct:footnoteFootnote reference[^id]
struct:rubyRuby annotation<ruby>

Code (code:) — non-translatable inline tokens:

TypeMeaningExamples
code:variableNamed variable{name}, $name, %s
code:placeholderPositional placeholder{0}, %1$s
code:functionICU function{count, plural, …}
code:markupGeneric preserved markuparbitrary format-specific tags

Entity (entity:) — also used by entity annotations:

TypeMeaning
entity:personPerson name
entity:organizationOrganization name
entity:productProduct name
entity:locationPlace name
entity:dateDate value
entity:timeTime value
entity:currencyCurrency amount
entity:measurementMeasurement value

Format-specific refinement via SubType

The SubType field provides format-specific refinement using a prefix convention. Reserved prefixes:

  • xlf: — XLIFF 2.0 subtypes (xlf:b, xlf:i, xlf:u, xlf:lb, xlf:pb, xlf:var)
  • html: — HTML element names (html:span, html:div, html:em)
  • md: — Markdown constructs (md:emphasis, md:strong)
  • docx: — OpenXML run properties (docx:w:b, docx:w:i)

Custom subtypes use a reverse-domain prefix: com.acme:custom-tag.

Run ID assignment and pairing

Format readers assign sequential numeric IDs to inline-code runs within each run sequence. A PcOpenRun and the PcCloseRun that closes it share the same ID; a PlaceholderRun gets its own. IDs start at "1". Pairs nest LIFO, and runs inside a Plural/Select branch form their own scope:

Input: Click <b>here</b> for <a href="x">info</a>.
Runs: TextRun "Click "
PcOpen{ID:"1", Type:"fmt:bold"} TextRun "here" PcClose{ID:"1"}
TextRun " for "
PcOpen{ID:"2", Type:"link:hyperlink"} TextRun "info" PcClose{ID:"2"}
TextRun "."

This produces stable structural keys for TM matching: HTML <b>Click</b> and Markdown **Click** both yield {1}Click{/1} — TM entries created from HTML match Markdown sources at the structural tier.

Run text projections

A Run sequence is the single source of truth; every textual form that crosses a boundary is a projection computed from []Run on demand. The framework provides (in core/model/):

// Plain flattening — TextRun content verbatim, placeholders contribute
// {equiv}, paired codes contribute their inner content, plural/select take
// the 'other' branch. Use: word count, search, QA text comparison. A
// text-only variant (inline codes contribute nothing) backs plain-text views.
func FlattenRuns(runs []Run) string

// Structural text — inline-code runs become numbered placeholders ({1},
// {/1}, {2/}). Use: TM exact matching (structural tier).
func RunsStructuralText(runs []Run) string

// Generalized text — entity Ph runs become typed placeholders ({PERSON}),
// other inline codes become numbered. Use: TM generalized matching.
func RunsGeneralizedText(runs []Run) string

// Markup-preserving render — re-emits each run's captured Data verbatim.
// Use: HTML/XML/Markdown writers splicing opaque markup back into a string.
func RenderRunsWithData(runs []Run) string

Example, the Run sequence TextRun "Click ", PcOpen{ID:"1", Type:"fmt:bold", Data:"<b class='x'>"}, TextRun "here", PcClose{ID:"1", Data:"</b>"}, TextRun " for info":

FlattenRuns(): "Click here for info"
RunsStructuralText(): "Click {1}here{/1} for info"
RenderRunsWithData(): "Click <b class='x'>here</b> for info"

Higher-level consumers layer further projections on top of the same []Run: the TypeScript side renders a PUA-coded form for the visual editor's styled chips, an <x id="1"/>… placeholder form for LLM prompts, and semantic HTML for commercial MT. The Block is identical; each consumer renders it differently.

Boundaries: structural canonical, projections at consumers

The neokapi inline-code model is structural-canonical. Run[] is the single source of truth for a Block's content. Every other representation that crosses a boundary — to a translator, an LLM, an MT provider, a CAT tool, a runtime, a TM index — is a projection computed from Run[] on demand.

This separation is deliberate:

  • Structural inside. Every internal pipeline component (filters, tools, store, editor, runtime resolvers) reads and writes Run[]. Type-rich, format-agnostic, lossless.
  • Textual at boundaries. Each external consumer gets a textual form purpose-built for it. Several projections coexist; each is tuned to the consumer's expectations and quality characteristics.

The framework provides:

ProjectionSurfaceConsumer
Run[] (no projection)Block.Source / Targets, KLF wirePipeline tools, store, format readers/writers
RenderRunsWithData(runs)native source markupFormat writers (HTML, Markdown, XLIFF fallback) — replays Data verbatim
RunsStructuralText(runs)Click {1}here{/1} for infoTM matching (structural tier) — cross-format leverage
RunsGeneralizedText(runs)structural + entity placeholdersTM matching (generalized tier)
RunsPlaceholderText(runs)<x id="1"/>here<x id="/1"/>LLM prompts where tag preservation is critical
RunsSemanticHTML(runs, reg)<a href="…">here</a>Commercial MT (DeepL, Google) and HTML-style LLM prompts
flattenRuns(runs) (TS)Click {=m0}here{/=m0}ICU runtime, kapi-react __tx re-attach
runsToCoded(runs) (TS)PUA-marker text + SpanInfo[]Visual editor (chips, formatting, semantic spans rendered as styled text)

Two consequences fall out of the convention:

  1. No single "translator format." A user editing in the framework's visual editor sees nested chips with semantic formatting (<b> rendered bold); the same Block in an external CAT tool comes through as XLIFF <pc>; the same Block sent to an LLM goes as RunsPlaceholderText or RunsSemanticHTML. The structural Block is identical; each consumer renders it differently.
  2. Format extensions follow the same rule. A new format reader, a new extractor (e.g., kapi-react), a new translator surface — each emits Run[] and lets the framework's existing projections handle every consumer. New textual conventions are only introduced when an existing projection is genuinely insufficient.

Reader and writer contracts

Readers populate every field on each inline-code run they emit:

  1. ID — sequential numeric; a paired PcOpen/PcClose share the same ID.
  2. Type / SubType — from the semantic-type vocabulary plus a format-specific refinement.
  3. Data — verbatim native markup for roundtrip fidelity.
  4. Disp — short human-readable label ("[B]", "[IMG]").
  5. Equiv — plain-text equivalent where applicable.
  6. ConstraintsDeletable, Cloneable, Reorderable based on format semantics.

Writers reconstruct output using Run.Data (the native markup), not the semantic type. This ensures perfect roundtrip fidelity — the writer replays exactly what the reader captured, which is what RenderRunsWithData does:

func RenderRunsWithData(runs []Run) string {
var buf strings.Builder
for _, r := range runs {
switch {
case r.Text != nil:
buf.WriteString(r.Text.Text)
case r.Ph != nil:
buf.WriteString(r.Ph.Data)
case r.PcOpen != nil:
buf.WriteString(r.PcOpen.Data)
case r.PcClose != nil:
buf.WriteString(r.PcClose.Data)
// …Sub replays Ref; Plural/Select recurse into the 'other' branch.
}
}
return buf.String()
}

Layers and embedded content

Embedded content is modeled as nested Layers. A Layer carries its own DataFormat identifier and a ParentID linking it to the enclosing layer. When a format reader encounters embedded content (e.g., an HTML string inside a JSON value), it emits a child Layer with Format: "html" containing the parsed HTML Blocks, nested between the parent Layer's Parts:

PartLayerStart (format="json", id="doc1")
PartBlock (key: "title", text: "Hello")
PartLayerStart (format="html", id="sf1", parentID="doc1")
PartBlock ("Welcome to <b>our site</b>")
PartLayerEnd (id="sf1")
PartData (structural JSON)
PartLayerEnd (id="doc1")

Each Layer is independently processable by format-aware tools. Layers nest recursively: HTML in JSON in YAML is three levels deep with no special cases.

SubfilterResolver

Format-to-format embedding is coordinated by a small interface:

type SubfilterResolver interface {
ResolveReader(formatName string) (DataFormatReader, error)
ResolveWriter(formatName string) (DataFormatWriter, error)
}

FormatRegistry implements this through its NewReader / NewWriter methods. The interface decouples format readers from the registry, prevents circular imports, and enables test mocks.

Readers and writers that support subfiltering implement a marker interface:

type SubfilterAware interface {
SetSubfilterResolver(r SubfilterResolver)
}

The resolver is injected before Open / Write is called. Any registered format (native, plugin, or bridge) can serve as a subfilter.

Format configs declare subfilter mappings that bind content locations to a format reader:

subfilters:
- pattern: "*.body"
format: html
- pattern: "*.description"
format: markdown

Patterns use filepath.Match semantics with . as the path separator. JSON readers use key-path globs; XML readers use element-path patterns.

When a reader encounters content matching a subfilter pattern, it emits PartLayerStart, delegates extraction to the sub-reader resolved via the SubfilterResolver, and emits PartLayerEnd when the sub-reader finishes. Writers buffer all parts between matching Layer boundaries, delegate to the sub-writer, and insert the rendered string into the parent format.

Integration with AI, MT, and TM

AI tools and MT providers pick the appropriate Run projection based on the backend's tag-handling capability:

  • Commercial MT APIs (DeepL, Google, Amazon) — use RunsSemanticHTML. The API preserves the semantic HTML tags; the framework restores the native markup from each run's original Data.
  • LLM translation — use RunsPlaceholderText or RunsSemanticHTML depending on prompt strategy. The response is parsed back into a []Run by matching placeholder tags to the source runs.
  • TM matching — three-tier matching uses FlattenRuns, RunsStructuralText, and RunsGeneralizedText in order. Because structural keys use run IDs and not native markup, TM entries created from HTML match Markdown at the structural tier.

Consequences

  • Type dispatch via switch part.Type replaces instanceof; linters provide compile-time exhaustiveness.
  • Adding new resource types requires only a new PartType constant and a struct implementing Resource.
  • Tools that only handle Blocks ignore all other Part types via the BaseTool pass-through behavior (AD-006: Tool System).
  • The Part stream remains a single ordered channel; no fan-out complexity in the base pipeline.
  • Content-addressable identity enables incremental extraction and deduplication across documents.
  • Dynamic properties and annotations let tools and connectors carry metadata without content-model changes.
  • The semantic-type abstraction lets TM match across formats and lets AI prompts receive consistent inline-code representations.
  • Writers replay Run.Data verbatim, so roundtrip fidelity is a property of the model, not of each format's implementation.
  • Layers nest recursively with no special cases — embedded content is a first-class pipeline citizen.
  • The Run union (paired PcOpen/PcClose, self-closing Ph, structured Plural/Select) aligns with XLIFF 2.0's <pc>/<ph> model, making XLIFF serialization a natural mapping rather than a lossy conversion.
  • Segmentation, terminology, entities, and QA share one run-anchored stand-off overlay mechanism; segmentation never mutates content, so it is opt-in, multi-layered, and losslessly reversible.
  • Bilingual interchange formats that carry sentence segments (XLIFF 2.0 <segment>/<ignorable>, TMX) project to and from segmentation + alignment overlays at the reader/writer boundary — opt-in byte-faithful round-trip — without forcing segment structure into the content model.
  • Targets are first-class records keyed by a variant (locale plus optional tone/channel); committed translations carry status and provenance, candidate proposals stay as alt-translation annotations, and the variant axis extends beyond locale without ceremony at locale-only call sites.