Skip to main content

Terminology Data Model

This note provides implementation details for AD-010.

Data Model: Concept-Oriented

The core data model is concept-oriented, following TBX principles. A Concept groups terms across languages, each with context dimensions:

type Term struct {
Text string // the term text
Locale model.LocaleID // language/locale
Status model.TermStatus // lifecycle status (proposed, approved, preferred,
// admitted, deprecated, forbidden)
PartOfSpeech string // noun, verb, adjective, etc.
Gender string // grammatical gender (if applicable)
Note string // usage note or context
CompetitorTerm bool // true if this is a competitor brand term
}

type Concept struct {
ID string // unique concept identifier
ProjectID string // project scope (empty = workspace-scoped)
Domain string // subject field (software, medical, legal, etc.)
Definition string // language-neutral definition
Source TermSource // "terminology" or "brand_vocabulary"
Terms []Term // terms across locales
Properties map[string]string // extensible metadata
CreatedAt time.Time
UpdatedAt time.Time
}

TermSource distinguishes traditional terminology (TermSourceTerminology) from brand vocabulary (TermSourceBrandVocabulary), so the two populations can share one termbase while staying filterable.

Progressive disclosure: CSV import auto-creates Concepts with a single preferred Term per locale -- no extra complexity required.

TermBase Interface

type TermBase interface {
AddConcept(concept Concept) error
GetConcept(id string) (Concept, bool)
DeleteConcept(id string) error
Lookup(sourceText string, opts LookupOptions) []TermMatch
LookupAll(sourceText string, opts LookupOptions) []TermMatch
Search(query string, sourceLocale, targetLocale model.LocaleID, offset, limit int) ([]Concept, int)
Count() int
Concepts() []Concept
Close() error
}

Import and export are standalone functions rather than interface methods: ImportJSON/ExportJSON, ImportCSV/ExportCSV, and ImportTBX/ExportTBX (the ISO TBX interchange format, with TBXImportOptions/TBXExportOptions). Framework backends: in-memory (CLI batch) and SQLite (persistent). The TermBase interface supports server-side backends for multi-user deployments.

Term lookup uses a tiered matching pipeline: exact -> normalized -> fuzzy. Fuzzy matching uses trigram-based candidate retrieval to avoid full table scans:

  • SQLite: FTS5 trigram tokenizer on text_lower column, synced via triggers. Falls back to length-based pre-filtering if FTS5 is unavailable.
  • PostgreSQL: pg_trgm GIN index on text_lower column, using the % similarity operator. Falls back to length-based pre-filtering.

Character-level Levenshtein scoring (on []rune) is applied to ~200 trigram candidates. This is correct for all scripts including CJK (each character is a morpheme).

UI search uses ranked full-text search:

  • SQLite: FTS5 trigram tokenizer for substring matching on term text.
  • PostgreSQL: pg_trgm similarity() ranking on text_lower.

Text normalization applies Unicode NFC (golang.org/x/text/unicode/norm) via NormalizeTerm() before comparison, handling Arabic diacritics, Hangul jamo composition, and accented Latin characters.

Pipeline Tools

Two pipeline tools integrate terminology into the streaming pipeline (AD-006):

term-lookup (Enrich) -- Scans source text for known terms, attaches TermAnnotation with TextRange character positions. Downstream tools (AI translate, QA) use these annotations for context.

term-enforce (Validate) -- Checks preferred term usage in target text. Reports forbidden terms, non-preferred variants, deprecated terms, and missing target counterparts.

Related AI and redaction tools (registered in core/ai/tools/ and core/tools/):

ai-terminology (Enrich, AI) -- LLM extraction of candidate terms. Uses an AI provider from AD-011.

ai-entity-extract (Enrich, AI) -- Named entity annotation (people, organizations, products, dates, locations). Serves multiple purposes: TM generalization in Sievepen (AD-009), do-not-translate markers, localization hints, and terminology candidate discovery. Should run early in the pipeline -- before tm-leverage.

redact (Transform) -- Privacy tool replacing entity values with typed placeholders (e.g., "John" -> \{PERSON\}) before external services. See AD-020.

unredact (Transform) -- Restores original entity values after external processing. Paired with redact: reader -> ai-entity-extract -> redact -> [external MT] -> unredact -> writer

Concept relations

Concepts are linked by persisted, typed, directed edges. A ConceptRelation records the edge with an identity, an optional note, and an optional validity:

type ConceptRelation struct {
ID string // edge identity (caller-assigned; required)
SourceID string // origin concept ID
TargetID string // target concept ID
RelationType string // a graph.Label* constant
Note string // optional human note
Validity *graph.Validity // optional time + tag scope (nil = unbounded)
CreatedAt time.Time
}

RelationType draws its values from the graph.Label* constants, so relation edges share the vocabulary used by the rest of the graph layer. KnownRelationType and ValidateRelation reject an unknown type or a missing ID before a write. The TermBase interface persists and queries edges:

AddRelation(ctx, rel ConceptRelation) error // upsert by ID
DeleteRelation(ctx, id string) error
RelationsOf(ctx, conceptID string, scope *graph.Scope) ([]ConceptRelation, error) // both directions
ListRelations(ctx, scope *graph.Scope) ([]ConceptRelation, error)

Temporal and tag validity

A Term and a ConceptRelation each carry an optional *graph.Validity: a half-open [ValidFrom, ValidTo) interval plus a map[string]string of tags. LookupOptions.Scope and the relation read methods take a *graph.Scope (a time plus tags); a term or edge is returned only when its validity matches the scope (a nil validity always matches; a nil scope filters nothing). Tags are open-ended — a caller picks a vocabulary, such as a market key.

Status transitions

ValidateTransition(from, to model.TermStatus) error accepts any transition between known statuses (it rejects only unknown statuses), and IsGovernedTransition(from, to) bool reports whether a transition is consequential: any transition to forbidden or preferred, or from forbidden. The framework classifies; it imposes no review workflow.

Content model extensions

  • TermAnnotation -- matched term with concept, target terms, and position
  • EntityAnnotation -- named entity with type, DNT flag, and position

These join AltTranslation as first-class annotations on Blocks (AD-002).