Terminology Data Model
This note provides implementation details for AD-010.
Data Model: Concept-Oriented
The core data model is concept-oriented, following TBX principles. A Concept groups terms across languages, each with context dimensions:
type Term struct {
Text string // the term text
Locale model.LocaleID // language/locale
Status model.TermStatus // lifecycle status (proposed, approved, preferred,
// admitted, deprecated, forbidden)
PartOfSpeech string // noun, verb, adjective, etc.
Gender string // grammatical gender (if applicable)
Note string // usage note or context
CompetitorTerm bool // true if this is a competitor brand term
}
type Concept struct {
ID string // unique concept identifier
ProjectID string // project scope (empty = workspace-scoped)
Domain string // subject field (software, medical, legal, etc.)
Definition string // language-neutral definition
Source TermSource // "terminology" or "brand_vocabulary"
Terms []Term // terms across locales
Properties map[string]string // extensible metadata
CreatedAt time.Time
UpdatedAt time.Time
}
TermSource distinguishes traditional terminology
(TermSourceTerminology) from brand vocabulary (TermSourceBrandVocabulary),
so the two populations can share one termbase while staying filterable.
Progressive disclosure: CSV import auto-creates Concepts with a single preferred Term per locale -- no extra complexity required.
TermBase Interface
type TermBase interface {
AddConcept(concept Concept) error
GetConcept(id string) (Concept, bool)
DeleteConcept(id string) error
Lookup(sourceText string, opts LookupOptions) []TermMatch
LookupAll(sourceText string, opts LookupOptions) []TermMatch
Search(query string, sourceLocale, targetLocale model.LocaleID, offset, limit int) ([]Concept, int)
Count() int
Concepts() []Concept
Close() error
}
Import and export are standalone functions rather than interface methods:
ImportJSON/ExportJSON, ImportCSV/ExportCSV, and ImportTBX/ExportTBX
(the ISO TBX interchange format, with TBXImportOptions/TBXExportOptions).
Framework backends: in-memory (CLI batch) and SQLite (persistent). The
TermBase interface supports server-side backends for multi-user deployments.
Fuzzy Matching and Search
Term lookup uses a tiered matching pipeline: exact -> normalized -> fuzzy. Fuzzy matching uses trigram-based candidate retrieval to avoid full table scans:
- SQLite: FTS5
trigramtokenizer ontext_lowercolumn, synced via triggers. Falls back to length-based pre-filtering if FTS5 is unavailable. - PostgreSQL: pg_trgm GIN index on
text_lowercolumn, using the%similarity operator. Falls back to length-based pre-filtering.
Character-level Levenshtein scoring (on []rune) is applied to ~200 trigram candidates. This is correct for all scripts including CJK (each character is a morpheme).
UI search uses ranked full-text search:
- SQLite: FTS5
trigramtokenizer for substring matching on term text. - PostgreSQL: pg_trgm
similarity()ranking ontext_lower.
Text normalization applies Unicode NFC (golang.org/x/text/unicode/norm) via NormalizeTerm() before comparison, handling Arabic diacritics, Hangul jamo composition, and accented Latin characters.
Pipeline Tools
Two pipeline tools integrate terminology into the streaming pipeline (AD-006):
term-lookup (Enrich) -- Scans source text for known terms, attaches TermAnnotation with TextRange character positions. Downstream tools (AI translate, QA) use these annotations for context.
term-enforce (Validate) -- Checks preferred term usage in target text. Reports forbidden terms, non-preferred variants, deprecated terms, and missing target counterparts.
Related AI and redaction tools (registered in core/ai/tools/ and
core/tools/):
ai-terminology (Enrich, AI) -- LLM extraction of candidate terms. Uses an AI provider from AD-011.
ai-entity-extract (Enrich, AI) -- Named entity annotation (people, organizations, products, dates, locations). Serves multiple purposes: TM generalization in Sievepen (AD-009), do-not-translate markers, localization hints, and terminology candidate discovery. Should run early in the pipeline -- before tm-leverage.
redact (Transform) -- Privacy tool replacing entity values with typed placeholders (e.g., "John" -> \{PERSON\}) before external services. See AD-020.
unredact (Transform) -- Restores original entity values after external processing. Paired with redact:
reader -> ai-entity-extract -> redact -> [external MT] -> unredact -> writer
Concept relations
Concepts are linked by persisted, typed, directed edges. A ConceptRelation
records the edge with an identity, an optional note, and an optional validity:
type ConceptRelation struct {
ID string // edge identity (caller-assigned; required)
SourceID string // origin concept ID
TargetID string // target concept ID
RelationType string // a graph.Label* constant
Note string // optional human note
Validity *graph.Validity // optional time + tag scope (nil = unbounded)
CreatedAt time.Time
}
RelationType draws its values from the graph.Label* constants, so relation
edges share the vocabulary used by the rest of the graph layer.
KnownRelationType and ValidateRelation reject an unknown type or a missing
ID before a write. The TermBase interface persists and queries edges:
AddRelation(ctx, rel ConceptRelation) error // upsert by ID
DeleteRelation(ctx, id string) error
RelationsOf(ctx, conceptID string, scope *graph.Scope) ([]ConceptRelation, error) // both directions
ListRelations(ctx, scope *graph.Scope) ([]ConceptRelation, error)
Temporal and tag validity
A Term and a ConceptRelation each carry an optional *graph.Validity: a
half-open [ValidFrom, ValidTo) interval plus a map[string]string of tags.
LookupOptions.Scope and the relation read methods take a *graph.Scope (a
time plus tags); a term or edge is returned only when its validity matches the
scope (a nil validity always matches; a nil scope filters nothing). Tags are
open-ended — a caller picks a vocabulary, such as a market key.
Status transitions
ValidateTransition(from, to model.TermStatus) error accepts any transition
between known statuses (it rejects only unknown statuses), and
IsGovernedTransition(from, to) bool reports whether a transition is
consequential: any transition to forbidden or preferred, or from
forbidden. The framework classifies; it imposes no review workflow.
Content model extensions
TermAnnotation-- matched term with concept, target terms, and positionEntityAnnotation-- named entity with type, DNT flag, and position
These join AltTranslation as first-class annotations on Blocks (AD-002).