Translation Memory

neokapi's translation memory is Sievepen (sievepen/). Unlike traditional TMs that store plain strings, Sievepen works with the full content model — each entry holds multilingual variants as Run sequences (text plus inline markup) and matches them in three tiers with entity-aware adaptation. The same engine backs the kapi tm commands, the tm-leverage pipeline tool, and the Go library.

Content-aware matching

Each entry is indexed under three keys, tried in order, so the highest-quality match is returned first:

Tier	Match type	Normalizes	Example
1	Generalized	Named entities → typed placeholders	"Welcome, John" → "Welcome, {PERSON}"
2	Structural	Inline markup → normalized codes	"Click here" → "Click {1}here{/1}"
3	Plain	Nothing (raw text)	Levenshtein fuzzy matching

Each tier yields exact (100%) or fuzzy matches. When a generalized exact match is found, entity values from the current source are adapted into the stored target — so "Welcome, Bob" → "Bienvenue, Bob" adapts to "Welcome, Alice" → "Bienvenue, Alice" at 100%. This ordering mirrors how a translator evaluates matches: entity differences matter less than structural ones, which matter less than textual changes.

The typed placeholders the generalized tier keys on ({PERSON}, {PRODUCT}, …) come from entity detection — a fast local model or an LLM that recognizes the named things in a block. You don't run detection as a separate task: it happens as part of preparing content, and the same detection also powers redaction. Annotate entities once and both generalized TM reuse and redaction follow.

Storage backends

Two backends ship in the sievepen/ package, both implementing the TranslationMemory interface with full tier support:

In-memory (sievepen.NewInMemoryTM) — fast and ephemeral, used for session-scoped batch processing.
SQLite (sievepen.NewSQLiteTM) — persistent file-based storage for CLI workflows.

The interface also accommodates server-side backends for multi-user deployments with project scoping, streams, and workspace isolation. Fuzzy matching uses Levenshtein edit distance with a configurable threshold (default 0.70); results are sorted by score and then by tier.

CLI usage

Resource location

All TM commands (except list) accept these mutually exclusive flags:

Flag	Resolves to	Example
`--name <n>`	`~/.config/kapi/tm/<n>.db`	`--name project-tm`
`--local`	`./tm.db` (current directory)	`--local`
`--file <path>`	Explicit file path	`--file /shared/memory.db`
(no flag)	Same as `--local`

Databases are created on demand if they don't exist.

kapi tm import translations.tmx --name project-tm -s en -t fr
kapi tm export --name project-tm -s en -t fr -o output.tmx
kapi tm lookup "Welcome to our platform" --name project-tm -s en -t fr
kapi tm search "welcome" --name project-tm -s en
kapi tm stats --name project-tm
kapi tm list

Pipeline integration

The tm-leverage tool queries the TM for each Block's source segments and applies matches. Every match — exact or fuzzy — is recorded as an AltTranslation annotation (matched source/target runs, score, match type, tm origin), and a filled target is committed with provenance (Origin{Kind: "tm", Tool: "tm-leverage"}), its score, and draft status, so the leverage is auditable rather than an opaque overwrite. Exact matches skip AI translation, reducing cost and latency.

Segment-aware leverage. When a block carries a multi-segment segmentation overlay (a prose paragraph split into sentences), tm-leverage looks up the TM per sentence. This recovers leverage for multi-sentence blocks that would never match the sentence-keyed TM as a single unit. A single-segment block (most software-localization strings) takes the whole-block path unchanged.

The result is recorded so it is auditable, not blind:

Each matching sentence is attached as an AltTranslation annotation (matched source and target runs, score, exact/fuzzy match type, tm origin) — kept whether or not the block target is filled, so partial leverage (some sentences matched, some new) is preserved for a reviewer or a later translation stage rather than discarded.
The block records tm-segment-matches (e.g. 3/5) for quick gating.
The block target is filled only when every sentence matched and the segments are contiguous; when it is, the committed target carries provenance (Origin{Kind: "tm", Tool: "tm-leverage"}), the roll-up Score, and draft status — a reviewable pre-fill, not a signed-off translation.

Run segmentation before tm-leverage to enable this.

kapi tm-leverage -i input.html -o output.html --source-lang en --target-lang fr --tm project-tm

steps:
  - tool: tm-leverage
    config:
      fuzzyThreshold: 70 # minimum score for fuzzy matches (0-100)
      fillTarget: true # copy the best candidate into the target
      fillTargetThreshold: 95 # minimum score required to fill the target

Go library

Interface

type TranslationMemory interface {
    Add(entry TMEntry) error
    Lookup(source *model.Block, sourceLocale, targetLocale model.LocaleID,
        opts LookupOptions) ([]TMMatch, error)
    LookupText(source string, sourceLocale, targetLocale model.LocaleID,
        opts LookupOptions) ([]TMMatch, error)
    Delete(id string) error
    Count() int
    Close() error
}

Lookup takes a full *model.Block and uses its Run content (and entity annotations) for tiered matching; LookupText takes a plain string and performs plain-tier matching only. LookupSegment matches a single segment of a block for sentence-level leverage. Both SQLite and in-memory backends also implement EntryProvider (Entries() and paginated SearchEntries(...)) for export and browsing.

Key types

type TMEntry struct {
    ID          string
    ProjectID   string
    Variants    map[model.LocaleID][]model.Run // peer language variants
    HintSrcLang model.LocaleID                 // locale the author treated as canonical
    Entities    []EntityMapping                // entity placeholders
    Properties  map[string]string
    Origins     []Origin
    Note        string
    CreatedAt   time.Time
    UpdatedAt   time.Time
}

type TMMatch struct {
    Entry             TMEntry
    Score             float64 // 0.0-1.0
    MatchType         MatchType
    ProjectID         string             // provenance of the matched entry
    EntityAdaptations []EntityAdaptation // entity value substitutions
}

type LookupOptions struct {
    MinScore     float64      // minimum match score (default 0.7)
    MaxResults   int          // max results to return (default 10)
    MatchModes   []MatchMode  // which tiers to use (default: all)
    ProjectID    string       // project context for scoring boost
    ProjectScope ProjectScope // project filtering mode (default: all)
}

An entry is multilingual: there is no authoritative source at the persistence layer — each language is a peer Variants[locale] Run sequence, and the lookup direction is supplied at the call site. MatchType ranges from generalized-exact (highest reuse) through structural-exact, exact, the corresponding fuzzy variants, down to fuzzy. TMEntry helpers: Variant(locale), VariantText(locale), VariantStructural(locale), VariantGeneralized(locale). The EntityAdaptations field on a match lists each substitution with its position so consumers can apply adaptations precisely.

Example

package main

import (
    "fmt"

    "github.com/neokapi/neokapi/core/model"
    "github.com/neokapi/neokapi/sievepen"
)

func main() {
    tm := sievepen.NewInMemoryTM()
    defer tm.Close()

    tm.Add(sievepen.TMEntry{
        ID: "e1",
        Variants: map[model.LocaleID][]model.Run{
            "en": {{Text: &model.TextRun{Text: "Welcome to our platform"}}},
            "fr": {{Text: &model.TextRun{Text: "Bienvenue sur notre plateforme"}}},
        },
        HintSrcLang: "en",
    })

    block := model.NewBlock("b1", "Welcome to our platform")
    matches, err := tm.Lookup(block, "en", "fr", sievepen.DefaultLookupOptions())
    if err != nil {
        panic(err)
    }
    for _, m := range matches {
        fmt.Printf("Score: %.0f%% Type: %s Target: %s\n",
            m.Score*100, m.MatchType, m.Entry.VariantText("fr"))
    }
}

TMX import / export

count, err := sievepen.ImportTMX(tm, reader, "en", "fr")
err = sievepen.ExportTMXBilingual(tm, writer, "en", "fr") // src/tgt pair
// or, for all locales in the TM:
err = sievepen.ExportTMX(tm, writer, []model.LocaleID{"en", "fr", "de"})