Gå til hovedinnhold

Translation Memory

neokapi's translation memory is Sievepen (sievepen/). Unlike traditional TMs that store plain strings, Sievepen works with the full content model — each entry holds multilingual variants as Run sequences (text plus inline markup) and matches them in three tiers with entity-aware adaptation. The same engine backs the kapi tm commands, the tm-leverage pipeline tool, and the Go library.

Content-aware matching

Each entry is indexed under three keys, tried in order, so the highest-quality match is returned first:

TierMatch typeNormalizesExample
1GeneralizedNamed entities → typed placeholders"Welcome, John" → "Welcome, {PERSON}"
2StructuralInline markup → normalized codes"Click here" → "Click {1}here{/1}"
3PlainNothing (raw text)Levenshtein fuzzy matching

Each tier yields exact (100%) or fuzzy matches. When a generalized exact match is found, entity values from the current source are adapted into the stored target — so "Welcome, Bob" → "Bienvenue, Bob" adapts to "Welcome, Alice" → "Bienvenue, Alice" at 100%. This ordering mirrors how a translator evaluates matches: entity differences matter less than structural ones, which matter less than textual changes.

The typed placeholders the generalized tier keys on ({PERSON}, {PRODUCT}, …) come from entity detection — a fast local model or an LLM that recognizes the named things in a block. You don't run detection as a separate task: it happens as part of preparing content, and the same detection also powers redaction. Annotate entities once and both generalized TM reuse and redaction follow.

Storage backends

Two backends ship in the sievepen/ package, both implementing the TranslationMemory interface with full tier support:

  1. In-memory (sievepen.NewInMemoryTM) — fast and ephemeral, used for session-scoped batch processing.
  2. SQLite (sievepen.NewSQLiteTM) — persistent file-based storage for CLI workflows.

The interface also accommodates server-side backends for multi-user deployments with project scoping, streams, and workspace isolation. Fuzzy matching uses Levenshtein edit distance with a configurable threshold (default 0.70); results are sorted by score and then by tier.

CLI usage

Resource location

All TM commands (except list) accept these mutually exclusive flags:

FlagResolves toExample
--name <n>~/.config/kapi/tm/<n>.db--name project-tm
--local./tm.db (current directory)--local
--file <path>Explicit file path--file /shared/memory.db
(no flag)Same as --local

Databases are created on demand if they don't exist.

kapi tm import translations.tmx --name project-tm -s en -t fr
kapi tm export --name project-tm -s en -t fr -o output.tmx
kapi tm lookup "Welcome to our platform" --name project-tm -s en -t fr
kapi tm search "welcome" --name project-tm -s en
kapi tm stats --name project-tm
kapi tm list

Pipeline integration

The tm-leverage tool queries the TM for each Block's source segments and applies matches. Every match — exact or fuzzy — is recorded as an AltTranslation annotation (matched source/target runs, score, match type, tm origin), and a filled target is committed with provenance (Origin{Kind: "tm", Tool: "tm-leverage"}), its score, and draft status, so the leverage is auditable rather than an opaque overwrite. Exact matches skip AI translation, reducing cost and latency.

Segment-aware leverage. When a block carries a multi-segment segmentation overlay (a prose paragraph split into sentences), tm-leverage looks up the TM per sentence. This recovers leverage for multi-sentence blocks that would never match the sentence-keyed TM as a single unit. A single-segment block (most software-localization strings) takes the whole-block path unchanged.

The result is recorded so it is auditable, not blind:

  • Each matching sentence is attached as an AltTranslation annotation (matched source and target runs, score, exact/fuzzy match type, tm origin) — kept whether or not the block target is filled, so partial leverage (some sentences matched, some new) is preserved for a reviewer or a later translation stage rather than discarded.
  • The block records tm-segment-matches (e.g. 3/5) for quick gating.
  • The block target is filled only when every sentence matched and the segments are contiguous; when it is, the committed target carries provenance (Origin{Kind: "tm", Tool: "tm-leverage"}), the roll-up Score, and draft status — a reviewable pre-fill, not a signed-off translation.

Run segmentation before tm-leverage to enable this.

kapi tm-leverage -i input.html -o output.html --source-lang en --target-lang fr --tm project-tm
steps:
- tool: tm-leverage
config:
fuzzyThreshold: 70 # minimum score for fuzzy matches (0-100)
fillTarget: true # copy the best candidate into the target
fillTargetThreshold: 95 # minimum score required to fill the target

Go library

Interface

type TranslationMemory interface {
Add(entry TMEntry) error
Lookup(source *model.Block, sourceLocale, targetLocale model.LocaleID,
opts LookupOptions) ([]TMMatch, error)
LookupText(source string, sourceLocale, targetLocale model.LocaleID,
opts LookupOptions) ([]TMMatch, error)
Delete(id string) error
Count() int
Close() error
}

Lookup takes a full *model.Block and uses its Run content (and entity annotations) for tiered matching; LookupText takes a plain string and performs plain-tier matching only. LookupSegment matches a single segment of a block for sentence-level leverage. Both SQLite and in-memory backends also implement EntryProvider (Entries() and paginated SearchEntries(...)) for export and browsing.

Key types

type TMEntry struct {
ID string
ProjectID string
Variants map[model.LocaleID][]model.Run // peer language variants
HintSrcLang model.LocaleID // locale the author treated as canonical
Entities []EntityMapping // entity placeholders
Properties map[string]string
Origins []Origin
Note string
CreatedAt time.Time
UpdatedAt time.Time
}

type TMMatch struct {
Entry TMEntry
Score float64 // 0.0-1.0
MatchType MatchType
ProjectID string // provenance of the matched entry
EntityAdaptations []EntityAdaptation // entity value substitutions
}

type LookupOptions struct {
MinScore float64 // minimum match score (default 0.7)
MaxResults int // max results to return (default 10)
MatchModes []MatchMode // which tiers to use (default: all)
ProjectID string // project context for scoring boost
ProjectScope ProjectScope // project filtering mode (default: all)
}

An entry is multilingual: there is no authoritative source at the persistence layer — each language is a peer Variants[locale] Run sequence, and the lookup direction is supplied at the call site. MatchType ranges from generalized-exact (highest reuse) through structural-exact, exact, the corresponding fuzzy variants, down to fuzzy. TMEntry helpers: Variant(locale), VariantText(locale), VariantStructural(locale), VariantGeneralized(locale). The EntityAdaptations field on a match lists each substitution with its position so consumers can apply adaptations precisely.

Example

package main

import (
"fmt"

"github.com/neokapi/neokapi/core/model"
"github.com/neokapi/neokapi/sievepen"
)

func main() {
tm := sievepen.NewInMemoryTM()
defer tm.Close()

tm.Add(sievepen.TMEntry{
ID: "e1",
Variants: map[model.LocaleID][]model.Run{
"en": {{Text: &model.TextRun{Text: "Welcome to our platform"}}},
"fr": {{Text: &model.TextRun{Text: "Bienvenue sur notre plateforme"}}},
},
HintSrcLang: "en",
})

block := model.NewBlock("b1", "Welcome to our platform")
matches, err := tm.Lookup(block, "en", "fr", sievepen.DefaultLookupOptions())
if err != nil {
panic(err)
}
for _, m := range matches {
fmt.Printf("Score: %.0f%% Type: %s Target: %s\n",
m.Score*100, m.MatchType, m.Entry.VariantText("fr"))
}
}

TMX import / export

count, err := sievepen.ImportTMX(tm, reader, "en", "fr")
err = sievepen.ExportTMXBilingual(tm, writer, "en", "fr") // src/tgt pair
// or, for all locales in the TM:
err = sievepen.ExportTMX(tm, writer, []model.LocaleID{"en", "fr", "de"})

Translation memory and terminology

TM and terminology are deliberately separate systems with different data shapes — TM stores segment pairs, terminology stores multi-locale concepts. They share the Block annotation system as their integration point, so both kinds of match are available to any downstream tool or editor.