AD-010: Terminology
Summary
neokapi's terminology system is concept-oriented: a Concept groups terms
across locales with per-term metadata (status, part of speech, grammatical
gender). The TermBase interface (termbase/ package) supports in-memory
and SQLite backends, a tiered lookup pipeline, and TBX import/export.
Terminology flows through the streaming pipeline via first-class annotation
types whose positions are run-anchored (RunRange) for precise inline
highlighting that survives run-preserving edits.
Context
Terminology management in localization ranges from simple glossaries (CSV with source/target pairs) to concept-oriented termbases (TBX, MultiTerm). A flat glossary does not express that "bug", "defect", and "issue" are terms for the same concept in different contexts, nor that "bug" can be preferred in engineering docs and deprecated in customer- facing content.
The framework needs:
- Progressive complexity — start from a CSV glossary, grow into concept management without rewriting data.
- Pipeline integration — terminology as streaming tools, not a separate service.
- Precise positions — run-anchored
RunRangespans on matched terms (run index + intra-run rune offset) so downstream UIs can highlight within a Fragment. - Annotation semantics — do-not-translate markers for entity names, locale formatting hints, and pending AI-proposed candidates distinct from curated entries.
TBX (ISO 30042:2019) is the universal interchange format for concept- oriented terminological data. Native storage uses SQLite for speed and query flexibility; TBX handles import and export only.
Decision
Concept-oriented data model
A Concept groups terms across locales, each with context:
// TermSource indicates whether a concept comes from traditional
// terminology or brand vocabulary.
type TermSource string
const (
TermSourceTerminology TermSource = "terminology"
TermSourceBrandVocabulary TermSource = "brand_vocabulary"
)
type Term struct {
Text string
Locale model.LocaleID
Status model.TermStatus // proposed, approved, preferred,
// admitted, deprecated, forbidden
PartOfSpeech string
Gender string
Note string
CompetitorTerm bool // marks a competitor brand term
}
type Concept struct {
ID string
ProjectID string
Domain string
Definition string
Source TermSource // "terminology" or "brand_vocabulary"
Terms []Term
Properties map[string]string
CreatedAt time.Time
UpdatedAt time.Time
}
Progressive disclosure: CSV import auto-creates Concepts with a single preferred Term per locale. No extra complexity is imposed on users who want a flat glossary.
TermBase interface
type TermBase interface {
AddConcept(concept Concept) error
GetConcept(id string) (Concept, bool)
DeleteConcept(id string) error
Lookup(sourceText string, opts LookupOptions) []TermMatch
LookupAll(sourceText string, opts LookupOptions) []TermMatch
Search(query string, sourceLocale, targetLocale model.LocaleID,
offset, limit int) ([]Concept, int)
Count() int
Concepts() []Concept
Close() error
}
Import and export are standalone functions rather than interface methods:
ImportTBX, ExportTBX, ImportCSV, ExportCSV, ImportJSON,
ExportJSON.
Backends
- In-memory (
termbase/memory.go) — fast, ephemeral; session-scoped batch processing. - SQLite (
termbase/sqlite.go) — persistent file-based storage for CLI tools. Pure Go viamodernc.org/sqlite.
A PostgreSQL backend with workspace isolation and terminology streams can be
supplied by a platform layer behind the same TermBase interface.
Tiered lookup
Term lookup follows a cascading pipeline:
- Exact — case-sensitive match on normalized term text.
- Normalized — Unicode NFC + case folding + whitespace collapse.
- Fuzzy — trigram candidate retrieval + Levenshtein scoring on the ~200 closest candidates.
- AI-assisted (opt-in) — LLM proposes candidate term mappings that
produce
TermCandidateAnnotationentries for human review.
The fuzzy tier uses the same SQLite FTS5 trigram tokenizer as Sievepen
(AD-009: Translation Memory), keeping lookup
cost sub-linear in termbase size. Text is normalized with Unicode NFC via
NormalizeTerm() before comparison. Character-level Levenshtein (on
[]rune) is correct for all scripts including CJK.
Which tiers run is selected per call through LookupOptions.MatchModes
([]model.MatchStrategy) on TermBase.Lookup/LookupAll, alongside
CaseSensitive, MinScore, and scope filters — so a caller can request, for
example, exact-only or exact-plus-fuzzy without changing the pipeline.
UI search
Distinct from lookup, the Search method powers the termbase browser in
the CLI and desktop UI. It uses an FTS5 tokenizer to support substring
search ranked by match quality, rather than unranked LIKE '%...%'
queries.
Annotations
Three annotation types — TermAnnotation, TermCandidateAnnotation, and
EntityAnnotation — implement the Annotation interface with run-anchored
RunRange positions for precise inline highlighting. (The termbase lookup
itself returns a character-level TextRange offset into the source text,
which the pipeline tool converts to a RunRange when it writes the
annotation onto the block.)
TermAnnotation— a matched term from the termbase, carrying concept ID, target term options, status, and position.TermCandidateAnnotation— AI-proposed term not yet in the termbase. Carries astatus: proposedmarker so UI reviewers can accept, reject, or defer.
An EntityAnnotation type carries named entities (people,
organizations, products, dates, locations) with run-anchored RunRange
positions and optional DNT (do-not-translate) flags. Entity annotations
serve multiple purposes:
- Input to Sievepen TM generalization (AD-009: Translation Memory).
- Do-not-translate markers consumed by AI translation.
- Locale formatting hints (dates, numbers) for downstream tools.
- Terminology candidate discovery.
These annotations join AltTranslation as first-class annotations on
Blocks.
Concept relations
The termbase persists typed, directed ConceptRelation edges between concepts.
Each edge has an ID, a source and target concept, a type drawn from the
SKOS-aligned vocabulary, an optional note, and an optional validity:
- broader / narrower — taxonomic relationships
(
skos:broader/skos:narrower). - part-of / has-part — compositional meronymy/holonymy.
- related — the associative relationship (
skos:related). - replaced-by — a superseded concept points to its replacement.
- use-instead — a discouraged term points at the preferred one.
- exact-match / close-match — cross-scheme equivalence
(
skos:exactMatch/skos:closeMatch). - competitor — a competitor's term.
KnownRelationType and ValidateRelation gate writes: a relation is rejected
unless its type is in the vocabulary and both concepts exist. The interface
exposes AddRelation, DeleteRelation, RelationsOf (both directions), and
ListRelations; the read methods take an optional *graph.Scope and return
only edges whose validity matches. Relations enable graph navigation in UIs and
deprecation workflows where a superseded concept's terms are flagged in new
content; the term-enforce tool resolves use-instead / replaced-by to name
the replacement.
Term and relation validity
A term and a relation each carry an optional *graph.Validity — a half-open
[valid-from, valid-to) interval plus free-form tags. LookupOptions.Scope
and the relation read methods accept a *graph.Scope (a point in time plus
tags) and return only the terms and edges active at that scope. This is how the
termbase answers as-of-time and within-a-tag-scope (for example, per-market)
questions; the framework assigns tags no meaning, leaving the vocabulary to the
caller.
Status transitions
ValidateTransition(from, to) accepts any transition between known statuses,
and IsGovernedTransition(from, to) flags the consequential ones — any
transition to forbidden or preferred, or from forbidden. The framework
classifies transitions; it does not impose a review workflow, leaving that to a
platform built on it.
Competitor terms
Terms carry a CompetitorTerm boolean flag marking competitor brand
terms. The brand-vocab-check tool surfaces competitor terms found in
source text as critical-severity brand-voice findings (and forbidden terms
as major-severity), supporting brand voice governance using the termbase's
brand-vocabulary source.
Pipeline tools
The framework ships built-in terminology tools as ordinary pipeline stages:
term-lookup(enrich) — scans source text for known terms, attachesTermAnnotationwith run-anchoredRunRangepositions. Downstream tools (AI translate, QA) use these annotations for context.term-enforce(validate) — for each known source term, checks that an acceptable target-locale translation (preferred/approved by default, configurable viaCheckStatuses) is present in the target text; flags blocks where the expected translation is missing. Forbidden-, deprecated-, and competitor-term detection is handled bybrand-vocab-check, which scans source text — notterm-enforce.ai-terminology(AI-assisted enrich) — LLM extraction of candidate terms withstatus: proposed. Uses a provider from AD-011: AI Providers.ai-entity-extract(AI-assisted enrich) — LLM-based named entity annotation (with optional NER). Should run early in the pipeline, beforetm-leverage.redactandunredact(transform) — pair that replaces entity values with typed placeholders before external services and restores them afterwards.
A full pipeline looks like:
TBX import and export
TBX (ISO 30042:2019) is the interchange format. Import maps TBX entries to Concepts and populates per-locale Terms. Export preserves concept relations, term status, and context fields.
Consequences
- Terminology is a first-class pipeline citizen, not a bolt-on post-processing step.
- Run-anchored annotation positions enable precise inline UI highlighting without re-detecting term boundaries at render time.
- Entity annotations drive both terminology extraction and TM generalization — a single annotation pass serves multiple consumers.
- Concept relations give UIs a graph substrate for browsing terminology without requiring a separate graph database in the framework.
CompetitorTermgives the framework a minimal hook for brand guardrails without depending on the full brand module.- The same storage backends as TM (in-memory, SQLite) keep the CLI dependency footprint small and cross-compilation simple.
Related
- AD-002: Content Model — annotations on Blocks
- AD-006: Tool System — pipeline tool pattern
- AD-009: Translation Memory — shared matching infrastructure, entity annotation input
- AD-011: AI Providers — LLM-based term extraction and entity annotation
- Terminology Data Model — full Go struct definitions, pipeline tool catalog, relations