Content preparation
Most of the value in a localization pipeline is decided before the first word is translated. Is the source clean and settled? Is it segmented into the units a translator or model should work on? Are the terms and named entities recognized, so they can be enforced, protected, or reused? Has the content been checked?
neokapi treats this as one content-preparation pass: a sequence of stages that annotate the source without destroying it. Every stage adds a run-anchored stand-off overlay to the same canonical block — the source runs are written once, settled, and then read by everything downstream.
This page is the map; each stage links to its own concept page.
1. Settle the source
Some operations rewrite the source itself — redaction replacing sensitive spans with placeholders, a normalizer, a simplifier. These transformers are ordinary ordered steps: the framework applier performs each rewrite inline and in order, rebasing any surviving run-anchored overlays (segments, term and entity spans) onto the new runs, so each transformer settles the source before later steps observe it. Placing a transformer early keeps that rebasing to a minimum — the flow's placement pass warns when one sits later than its earliest valid slot and rejects unsafe orderings outright.
2. Segment
Segmentation marks the boundaries — usually sentences — that translation and TM key on. It is an overlay, not a split, so a block can carry a sentence layer for translation alongside a coarser chunk layer for an LLM, and the unsegmented block is always recoverable. Choose a rule-based engine (SRX, the localization standard), a Unicode baseline (UAX-29), an LLM for semantic chunks, or the SaT ML model for text that rules segment poorly.
3. Recognize the named things
Two overlays capture the named things in the source — and both exist for the outcome they enable, not as ends in themselves:
- Terminology —
term-lookupmatches the project termbase against the source and attaches the concept, its preferred translations, and its status. This is both a translation resource (a glossary that feeds AI translation) and the basis for enforcement. - Entity detection — people, organizations, products, locations, dates and more are recognized automatically (a fast local model, an LLM, or both). You never run this as its own task: it is the detection that powers redaction (protect sensitive spans) and entity-generalized translation-memory reuse (match across every value of a name). Detection skips terms the termbase already covers, so the two passes complement rather than duplicate each other.
4. Check
Checks are tests for content: deterministic verifiers that
read the source (and, after translation, the target) and report
findings without modifying anything. In the preparation
pass, source-side checks catch problems early — doubled words, suspicious
patterns, off-vocabulary brand terms — and the same engine runs the bilingual
checks (placeholder integrity, do-not-translate survival, terminology
enforcement) after translation. Run as a gate, kapi check exits non-zero so CI
or an assistant's fix-loop acts on the findings.
One settled model, many readers
The point of doing all of this as overlays on one block is that every downstream reader sees the same canonical source:
- Translation memory matches on the segment layer and can generalize over entity spans.
- AI translation and MT translate per segment, with the matched terminology injected as guidance and do-not-translate entities protected.
- Checks point findings at the exact run range that broke — a sentence, a term, a placeholder.
Nothing is re-parsed or re-derived between stages, and removing any overlay returns the block to its prior state.
Putting it in a flow
The preparation pass is an ordinary flow: one ordered list of steps — a transformer to settle the model, then annotation steps, then translation, then a check gate.
steps:
- tool: redact # settle the source first (optional)
- tool: segmentation # sentence boundaries
config: { engine: srx }
- tool: term-lookup # match the termbase
- tool: ai-entity-extract # recognize entities
- tool: tm-leverage # reuse prior segment translations
- tool: ai-translate # translate the remainder
- tool: qa-check # gate on findings
In a .kapi project this lives as a named flow so
every run prepares content the same way and the overlays feed the project-local
TM and termbase. For a runnable, step-by-step version, see the
Prepare content for translation recipe.