Skip to main content

AD-010: Terminology

Summary

neokapi's terminology system is concept-oriented: a Concept groups terms across locales with per-term metadata (status, part of speech, grammatical gender). The TermBase interface (termbase/ package) supports in-memory and SQLite backends, a tiered lookup pipeline, and TBX import/export. Terminology flows through the streaming pipeline via first-class annotation types whose positions are run-anchored (RunRange) for precise inline highlighting that survives run-preserving edits.

Context

Terminology management in localization ranges from simple glossaries (CSV with source/target pairs) to concept-oriented termbases (TBX, MultiTerm). A flat glossary does not express that "bug", "defect", and "issue" are terms for the same concept in different contexts, nor that "bug" can be preferred in engineering docs and deprecated in customer- facing content.

The framework needs:

  • Progressive complexity — start from a CSV glossary, grow into concept management without rewriting data.
  • Pipeline integration — terminology as streaming tools, not a separate service.
  • Precise positions — run-anchored RunRange spans on matched terms (run index + intra-run rune offset) so downstream UIs can highlight within a Fragment.
  • Annotation semantics — do-not-translate markers for entity names, locale formatting hints, and pending AI-proposed candidates distinct from curated entries.

TBX (ISO 30042:2019) is the universal interchange format for concept- oriented terminological data. Native storage uses SQLite for speed and query flexibility; TBX handles import and export only.

Decision

Concept-oriented data model

A Concept groups terms across locales, each with context:

// TermSource indicates whether a concept comes from traditional
// terminology or brand vocabulary.
type TermSource string

const (
TermSourceTerminology TermSource = "terminology"
TermSourceBrandVocabulary TermSource = "brand_vocabulary"
)

type Term struct {
Text string
Locale model.LocaleID
Status model.TermStatus // proposed, approved, preferred,
// admitted, deprecated, forbidden
PartOfSpeech string
Gender string
Note string
CompetitorTerm bool // marks a competitor brand term
}

type Concept struct {
ID string
ProjectID string
Domain string
Definition string
Source TermSource // "terminology" or "brand_vocabulary"
Terms []Term
Properties map[string]string
CreatedAt time.Time
UpdatedAt time.Time
}

Progressive disclosure: CSV import auto-creates Concepts with a single preferred Term per locale. No extra complexity is imposed on users who want a flat glossary.

TermBase interface

type TermBase interface {
AddConcept(concept Concept) error
GetConcept(id string) (Concept, bool)
DeleteConcept(id string) error
Lookup(sourceText string, opts LookupOptions) []TermMatch
LookupAll(sourceText string, opts LookupOptions) []TermMatch
Search(query string, sourceLocale, targetLocale model.LocaleID,
offset, limit int) ([]Concept, int)
Count() int
Concepts() []Concept
Close() error
}

Import and export are standalone functions rather than interface methods: ImportTBX, ExportTBX, ImportCSV, ExportCSV, ImportJSON, ExportJSON.

Backends

  • In-memory (termbase/memory.go) — fast, ephemeral; session-scoped batch processing.
  • SQLite (termbase/sqlite.go) — persistent file-based storage for CLI tools. Pure Go via modernc.org/sqlite.

A PostgreSQL backend with workspace isolation and terminology streams can be supplied by a platform layer behind the same TermBase interface.

Tiered lookup

Term lookup follows a cascading pipeline:

  1. Exact — case-sensitive match on normalized term text.
  2. Normalized — Unicode NFC + case folding + whitespace collapse.
  3. Fuzzy — trigram candidate retrieval + Levenshtein scoring on the ~200 closest candidates.
  4. AI-assisted (opt-in) — LLM proposes candidate term mappings that produce TermCandidateAnnotation entries for human review.

The fuzzy tier uses the same SQLite FTS5 trigram tokenizer as Sievepen (AD-009: Translation Memory), keeping lookup cost sub-linear in termbase size. Text is normalized with Unicode NFC via NormalizeTerm() before comparison. Character-level Levenshtein (on []rune) is correct for all scripts including CJK.

Which tiers run is selected per call through LookupOptions.MatchModes ([]model.MatchStrategy) on TermBase.Lookup/LookupAll, alongside CaseSensitive, MinScore, and scope filters — so a caller can request, for example, exact-only or exact-plus-fuzzy without changing the pipeline.

Distinct from lookup, the Search method powers the termbase browser in the CLI and desktop UI. It uses an FTS5 tokenizer to support substring search ranked by match quality, rather than unranked LIKE '%...%' queries.

Annotations

Three annotation types — TermAnnotation, TermCandidateAnnotation, and EntityAnnotation — implement the Annotation interface with run-anchored RunRange positions for precise inline highlighting. (The termbase lookup itself returns a character-level TextRange offset into the source text, which the pipeline tool converts to a RunRange when it writes the annotation onto the block.)

  • TermAnnotation — a matched term from the termbase, carrying concept ID, target term options, status, and position.
  • TermCandidateAnnotation — AI-proposed term not yet in the termbase. Carries a status: proposed marker so UI reviewers can accept, reject, or defer.

An EntityAnnotation type carries named entities (people, organizations, products, dates, locations) with run-anchored RunRange positions and optional DNT (do-not-translate) flags. Entity annotations serve multiple purposes:

  • Input to Sievepen TM generalization (AD-009: Translation Memory).
  • Do-not-translate markers consumed by AI translation.
  • Locale formatting hints (dates, numbers) for downstream tools.
  • Terminology candidate discovery.

These annotations join AltTranslation as first-class annotations on Blocks.

Concept relations

The termbase persists typed, directed ConceptRelation edges between concepts. Each edge has an ID, a source and target concept, a type drawn from the SKOS-aligned vocabulary, an optional note, and an optional validity:

  • broader / narrower — taxonomic relationships (skos:broader / skos:narrower).
  • part-of / has-part — compositional meronymy/holonymy.
  • related — the associative relationship (skos:related).
  • replaced-by — a superseded concept points to its replacement.
  • use-instead — a discouraged term points at the preferred one.
  • exact-match / close-match — cross-scheme equivalence (skos:exactMatch / skos:closeMatch).
  • competitor — a competitor's term.

KnownRelationType and ValidateRelation gate writes: a relation is rejected unless its type is in the vocabulary and both concepts exist. The interface exposes AddRelation, DeleteRelation, RelationsOf (both directions), and ListRelations; the read methods take an optional *graph.Scope and return only edges whose validity matches. Relations enable graph navigation in UIs and deprecation workflows where a superseded concept's terms are flagged in new content; the term-enforce tool resolves use-instead / replaced-by to name the replacement.

Term and relation validity

A term and a relation each carry an optional *graph.Validity — a half-open [valid-from, valid-to) interval plus free-form tags. LookupOptions.Scope and the relation read methods accept a *graph.Scope (a point in time plus tags) and return only the terms and edges active at that scope. This is how the termbase answers as-of-time and within-a-tag-scope (for example, per-market) questions; the framework assigns tags no meaning, leaving the vocabulary to the caller.

Status transitions

ValidateTransition(from, to) accepts any transition between known statuses, and IsGovernedTransition(from, to) flags the consequential ones — any transition to forbidden or preferred, or from forbidden. The framework classifies transitions; it does not impose a review workflow, leaving that to a platform built on it.

Competitor terms

Terms carry a CompetitorTerm boolean flag marking competitor brand terms. The brand-vocab-check tool surfaces competitor terms found in source text as critical-severity brand-voice findings (and forbidden terms as major-severity), supporting brand voice governance using the termbase's brand-vocabulary source.

Pipeline tools

The framework ships built-in terminology tools as ordinary pipeline stages:

  • term-lookup (enrich) — scans source text for known terms, attaches TermAnnotation with run-anchored RunRange positions. Downstream tools (AI translate, QA) use these annotations for context.
  • term-enforce (validate) — for each known source term, checks that an acceptable target-locale translation (preferred/approved by default, configurable via CheckStatuses) is present in the target text; flags blocks where the expected translation is missing. Forbidden-, deprecated-, and competitor-term detection is handled by brand-vocab-check, which scans source text — not term-enforce.
  • ai-terminology (AI-assisted enrich) — LLM extraction of candidate terms with status: proposed. Uses a provider from AD-011: AI Providers.
  • ai-entity-extract (AI-assisted enrich) — LLM-based named entity annotation (with optional NER). Should run early in the pipeline, before tm-leverage.
  • redact and unredact (transform) — pair that replaces entity values with typed placeholders before external services and restores them afterwards.

A full pipeline looks like:

Sourcechanai-entity-extractchanterm-lookupchantm-leveragechanai-translatechanterm-enforcechanSink

TBX import and export

TBX (ISO 30042:2019) is the interchange format. Import maps TBX entries to Concepts and populates per-locale Terms. Export preserves concept relations, term status, and context fields.

Consequences

  • Terminology is a first-class pipeline citizen, not a bolt-on post-processing step.
  • Run-anchored annotation positions enable precise inline UI highlighting without re-detecting term boundaries at render time.
  • Entity annotations drive both terminology extraction and TM generalization — a single annotation pass serves multiple consumers.
  • Concept relations give UIs a graph substrate for browsing terminology without requiring a separate graph database in the framework.
  • CompetitorTerm gives the framework a minimal hook for brand guardrails without depending on the full brand module.
  • The same storage backends as TM (in-memory, SQLite) keep the CLI dependency footprint small and cross-compilation simple.