Terminology Data Model

This note provides implementation details for AD-010.

Data Model: Concept-Oriented

The core data model is concept-oriented, following TBX principles. A Concept groups terms across languages, each with context dimensions:

type Term struct {
    Text           string           // the term text
    Locale         model.LocaleID   // language/locale
    Status         model.TermStatus // lifecycle status (proposed, approved, preferred,
                                    // admitted, deprecated, forbidden)
    PartOfSpeech   string           // noun, verb, adjective, etc.
    Gender         string           // grammatical gender (if applicable)
    Note           string           // usage note or context
    CompetitorTerm bool             // true if this is a competitor brand term
}

type Concept struct {
    ID         string            // unique concept identifier
    ProjectID  string            // project scope (empty = workspace-scoped)
    Domain     string            // subject field (software, medical, legal, etc.)
    Definition string            // language-neutral definition
    Source     TermSource        // "terminology" or "brand_vocabulary"
    Terms      []Term            // terms across locales
    Properties map[string]string // extensible metadata
    CreatedAt  time.Time
    UpdatedAt  time.Time
}

TermSource distinguishes traditional terminology (TermSourceTerminology) from brand vocabulary (TermSourceBrandVocabulary), so the two populations can share one termbase while staying filterable.

Progressive disclosure: CSV import auto-creates Concepts with a single preferred Term per locale -- no extra complexity required.

TermBase Interface

type TermBase interface {
    AddConcept(concept Concept) error
    GetConcept(id string) (Concept, bool)
    DeleteConcept(id string) error
    Lookup(sourceText string, opts LookupOptions) []TermMatch
    LookupAll(sourceText string, opts LookupOptions) []TermMatch
    Search(query string, sourceLocale, targetLocale model.LocaleID, offset, limit int) ([]Concept, int)
    Count() int
    Concepts() []Concept
    Close() error
}

Import and export are standalone functions rather than interface methods: ImportJSON/ExportJSON, ImportCSV/ExportCSV, and ImportTBX/ExportTBX (the ISO TBX interchange format, with TBXImportOptions/TBXExportOptions). Framework backends: in-memory (CLI batch) and SQLite (persistent). The TermBase interface supports server-side backends for multi-user deployments.

Fuzzy Matching and Search

Term lookup uses a tiered matching pipeline: exact -> normalized -> fuzzy. Fuzzy matching uses trigram-based candidate retrieval to avoid full table scans:

SQLite: FTS5 trigram tokenizer on text_lower column, synced via triggers. Falls back to length-based pre-filtering if FTS5 is unavailable.
PostgreSQL: pg_trgm GIN index on text_lower column, using the % similarity operator. Falls back to length-based pre-filtering.

Character-level Levenshtein scoring (on []rune) is applied to ~200 trigram candidates. This is correct for all scripts including CJK (each character is a morpheme).

UI search uses ranked full-text search:

SQLite: FTS5 trigram tokenizer for substring matching on term text.
PostgreSQL: pg_trgm similarity() ranking on text_lower.

Text normalization applies Unicode NFC (golang.org/x/text/unicode/norm) via NormalizeTerm() before comparison, handling Arabic diacritics, Hangul jamo composition, and accented Latin characters.

Pipeline Tools

Two pipeline tools integrate terminology into the streaming pipeline (AD-006):

term-lookup (Enrich) -- Scans source text for known terms, attaches TermAnnotation with TextRange character positions. Downstream tools (AI translate, QA) use these annotations for context.

term-enforce (Validate) -- Checks preferred term usage in target text. Reports forbidden terms, non-preferred variants, deprecated terms, and missing target counterparts.

Related AI and redaction tools (registered in core/ai/tools/ and core/tools/):

ai-terminology (Enrich, AI) -- LLM extraction of candidate terms. Uses an AI provider from AD-011.

ai-entity-extract (Enrich, AI) -- Named entity annotation (people, organizations, products, dates, locations). Serves multiple purposes: TM generalization in Sievepen (AD-009), do-not-translate markers, localization hints, and terminology candidate discovery. Should run early in the pipeline -- before tm-leverage.

redact (Transform) -- Privacy tool replacing entity values with typed placeholders (e.g., "John" -> \{PERSON\}) before external services. See AD-020.

unredact (Transform) -- Restores original entity values after external processing. Paired with redact: reader -> ai-entity-extract -> redact -> [external MT] -> unredact -> writer

Concept relations

Concepts are linked by persisted, typed, directed edges. A ConceptRelation records the edge with an identity, an optional note, and an optional validity:

type ConceptRelation struct {
    ID           string          // edge identity (caller-assigned; required)
    SourceID     string          // origin concept ID
    TargetID     string          // target concept ID
    RelationType string          // a graph.Label* constant
    Note         string          // optional human note
    Validity     *graph.Validity // optional time + tag scope (nil = unbounded)
    CreatedAt    time.Time
}

RelationType draws its values from the graph.Label* constants, so relation edges share the vocabulary used by the rest of the graph layer. KnownRelationType and ValidateRelation reject an unknown type or a missing ID before a write. The TermBase interface persists and queries edges:

AddRelation(ctx, rel ConceptRelation) error            // upsert by ID
DeleteRelation(ctx, id string) error
RelationsOf(ctx, conceptID string, scope *graph.Scope) ([]ConceptRelation, error) // both directions
ListRelations(ctx, scope *graph.Scope) ([]ConceptRelation, error)

Temporal and tag validity

A Term and a ConceptRelation each carry an optional *graph.Validity: a half-open [ValidFrom, ValidTo) interval plus a map[string]string of tags. LookupOptions.Scope and the relation read methods take a *graph.Scope (a time plus tags); a term or edge is returned only when its validity matches the scope (a nil validity always matches; a nil scope filters nothing). Tags are open-ended — a caller picks a vocabulary, such as a market key.

Status transitions

ValidateTransition(from, to model.TermStatus) error accepts any transition between known statuses (it rejects only unknown statuses), and IsGovernedTransition(from, to) bool reports whether a transition is consequential: any transition to forbidden or preferred, or from forbidden. The framework classifies; it imposes no review workflow.

Content model extensions

TermAnnotation -- matched term with concept, target terms, and position
EntityAnnotation -- named entity with type, DNT flag, and position

These join AltTranslation as first-class annotations on Blocks (AD-002).

Data Model: Concept-Oriented​

TermBase Interface​

Fuzzy Matching and Search​

Pipeline Tools​

Concept relations​

Temporal and tag validity​

Status transitions​

Content model extensions​