Authoring Vocabularies

This guide covers implementing and extending vocabularies — the semantic type system that classifies inline codes. For what vocabularies are and why they exist, see the concept page: Vocabularies.

Vocabulary file format

Each vocabulary is a JSON file. Types are keyed by a category:name identifier and carry rendering, display, color, and constraint metadata:

{
  "name": "my-vocabulary",
  "version": "1.0",
  "extends": "common-formatting",
  "entity_prefix": "entity:",
  "types": {
    "category:type-name": {
      "category": "category-name",
      "label": "Human Readable Label",
      "html": {
        "open": "<tag>",
        "close": "</tag>",
        "placeholder": "<tag/>"
      },
      "display": {
        "open": "[TAG]",
        "close": "[/TAG]",
        "placeholder": "[TAG/]"
      },
      "chipLabel": {
        "open": "tag>",
        "close": "/tag",
        "placeholder": "tag"
      },
      "color": {
        "bg": "rgba(59,130,246,0.15)",
        "border": "rgba(59,130,246,0.5)",
        "text": "rgb(59,130,246)"
      },
      "equiv": "",
      "constraints": {
        "deletable": true,
        "cloneable": true,
        "reorderable": true
      }
    }
  },
  "fallback": {
    "html": { "open": "<span>", "close": "</span>", "placeholder": "<span/>" },
    "display": { "open": "[?]", "close": "[/?]", "placeholder": "[?/]" },
    "chipLabel": { "open": "?>", "close": "/?", "placeholder": "?" },
    "color": {
      "bg": "rgba(156,163,175,0.15)",
      "border": "rgba(156,163,175,0.5)",
      "text": "rgb(107,114,128)"
    },
    "constraints": { "deletable": true, "cloneable": true, "reorderable": true }
  }
}

Field reference

Field	Required	Description
`name`	Yes	Unique vocabulary name
`version`	Yes	Semver version string
`extends`	No	Parent vocabulary name (types are merged)
`entity_prefix`	No	Prefix for entity-type inline codes (default `"entity:"`)
`types`	Yes	Map of type name → `SpanTypeInfo`
`fallback`	No	Default rendering for unknown types

Type name convention

Type names follow the category:name pattern: fmt:bold, link:hyperlink, code:variable, struct:break.

Constraint semantics

Constraint	`true`	`false`
`deletable`	Translator may remove the tag	Tag must appear in translation (enforced)
`cloneable`	Translator may duplicate the tag	Tag count must not exceed source count
`reorderable`	Translator may rearrange tag position	Tag position relative to others is locked

Using vocabularies in a format reader

A format reader initializes a VocabularyRegistry and uses it to populate inline-code metadata as it builds a Block's []model.Run sequence:

package myformat

import "github.com/neokapi/neokapi/core/model"

type Reader struct {
    vocab *model.VocabularyRegistry
}

func NewReader() *Reader {
    vocab := model.NewVocabularyRegistry()
    _ = vocab.LoadDefaults() // common-formatting + rich-html + rich-jsx + code-tokens
    return &Reader{vocab: vocab}
}

Inline content is a flat []model.Run (see AD-002: Content Model). An opening tag becomes a PcOpenRun, its matching close a PcCloseRun with the same ID, and a self-closing construct a PlaceholderRun. When building one, look up the vocabulary entry and populate the rendering and constraint fields — mirroring the per-format runBuilder helpers (core/formats/*/run_builder.go):

// openRun builds the opening half of a paired code, e.g. <b> / <a href="…">.
func (r *Reader) openRun(semType, subType, id, nativeMarkup string) model.Run {
    info := r.vocab.LookupOrFallback(semType)
    return model.Run{PcOpen: &model.PcOpenRun{
        ID:      id,            // shared with the matching PcClose
        Type:    semType,       // "fmt:bold"
        SubType: subType,       // "html:b" or "md:strong"
        Data:    nativeMarkup,  // original markup for roundtrip
        Disp:    info.Display.Open,    // "[B]"
        Equiv:   info.Equiv,           // "" (or "\n" for struct:break)
        Constraints: &model.RunConstraints{
            Deletable:   info.Constraints.Deletable,
            Cloneable:   info.Constraints.Cloneable,
            Reorderable: info.Constraints.Reorderable,
        },
    }}
}

// closeRun builds the matching close. PcCloseRun shares the opener's ID and
// replays its own native markup; it inherits the opener's constraints.
func (r *Reader) closeRun(semType, subType, id, nativeMarkup string) model.Run {
    info := r.vocab.LookupOrFallback(semType)
    return model.Run{PcClose: &model.PcCloseRun{
        ID:      id,
        Type:    semType,
        SubType: subType,
        Data:    nativeMarkup,  // "</b>"
        Equiv:   info.Equiv,
    }}
}

// phRun builds a self-closing placeholder, e.g. <br/> or a variable token.
func (r *Reader) phRun(semType, subType, id, nativeMarkup string) model.Run {
    info := r.vocab.LookupOrFallback(semType)
    return model.Run{Ph: &model.PlaceholderRun{
        ID:      id,
        Type:    semType,
        SubType: subType,
        Data:    nativeMarkup,
        Disp:    info.Display.Placeholder, // "[BR/]"
        Equiv:   info.Equiv,               // "\n" for struct:break
        Constraints: &model.RunConstraints{
            Deletable:   info.Constraints.Deletable,
            Cloneable:   info.Constraints.Cloneable,
            Reorderable: info.Constraints.Reorderable,
        },
    }}
}

Mapping native elements to semantic types

Each format maps its native constructs to semantic types. The HTML reader keys a name → type map on the element name:

var htmlSemanticTypes = map[string]string{
    "b": "fmt:bold", "strong": "fmt:bold",
    "i": "fmt:italic", "em": "fmt:italic",
    "u": "fmt:underline", "s": "fmt:strikethrough",
    "a": "link:hyperlink", "code": "fmt:code",
    "br": "struct:break", "img": "media:image",
    "sub": "fmt:subscript", "sup": "fmt:superscript", "mark": "fmt:highlight",
}

The Markdown reader has no such map. It switches on goldmark AST node types and assigns the semantic type per node before calling r.vocab.LookupOrFallback(…), resolving to the same vocabulary types:

Markdown construct	Semantic type
strong emphasis (`ast.Emphasis` level 2)	`fmt:bold`
emphasis (`ast.Emphasis` level 1)	`fmt:italic`
inline code	`fmt:code`
link	`link:hyperlink`
image	`link:image`

A soft line break is not a run: it is emitted as inline text continuation (see softBreakContinuation), not a struct:break placeholder.

SubType conventions

The SubType field records format-specific provenance using a prefix convention: html: (html:b, html:span), md: (md:strong), xlf: (xlf:var), docx: (docx:w:b). Custom formats should use a reverse-domain prefix: com.acme:custom-tag.

Creating a custom vocabulary

1. Create the JSON file

Create a JSON file under core/model/vocabularies/:

{
  "name": "my-domain",
  "version": "1.0",
  "extends": "common-formatting",
  "types": {
    "domain:widget": {
      "category": "domain",
      "label": "Widget",
      "html": { "placeholder": "<span class=\"widget\"/>" },
      "display": { "placeholder": "[WIDGET]" },
      "chipLabel": { "placeholder": "wgt" },
      "color": {
        "bg": "rgba(168,85,247,0.15)",
        "border": "rgba(168,85,247,0.5)",
        "text": "rgb(168,85,247)"
      },
      "equiv": "",
      "constraints": { "deletable": false, "cloneable": false, "reorderable": true }
    }
  }
}

2. Load it into the registry

LoadDefaults() loads the embedded vocabularies. To add one at runtime:

vocab := model.NewVocabularyRegistry()
vocab.LoadDefaults()

customData, _ := os.ReadFile("my-domain.json")
vocab.Load(customData)

3. Map it in your reader

Add the new type to your format reader's semantic type mapping:

var myFormatSemanticTypes = map[string]string{
    "widget": "domain:widget",
}

SpanClassify tool

For formats that do not perform full semantic classification (for example, when content arrives via the Okapi bridge), the span-classify tool reclassifies generic code:markup inline-code runs (Ph / PcOpen / PcClose) into proper semantic types:

tool := tools.NewSpanClassifyTool(&tools.SpanClassifyConfig{})

It applies strategies in order: check the run's SubType against known Okapi type strings, parse Data for an HTML element name, look that name up in the semantic type map, and otherwise leave the run as code:markup. The tool name is retained for backwards compatibility with existing flow definitions.

Testing vocabularies

func TestMyVocabulary(t *testing.T) {
    vocab := model.NewVocabularyRegistry()
    require.NoError(t, vocab.LoadDefaults())

    info := vocab.Lookup("fmt:bold")
    require.NotNil(t, info)
    assert.Equal(t, "formatting", info.Category)
    assert.True(t, info.Constraints.Deletable)

    unknown := vocab.LookupOrFallback("custom:unknown")
    require.NotNil(t, unknown)
    assert.True(t, unknown.Constraints.Deletable) // fallback rendering
}

Best practices

Use existing types when possible. Map to fmt:bold rather than creating my-format:bold.
Set constraints conservatively. Mark code tokens non-deletable; formatting fully flexible.
Keep vocabularies small. Only add types with distinct rendering or constraint needs.
Test roundtrip fidelity. Vocabulary types affect rendering, but each run's Data drives output — verify both.
Extend rather than replace. Use extends to build on common-formatting.

Vocabularies — the concept and built-in vocabularies.
Implementing a Format — building readers and writers.
Inline Formatting — the inline-code model in the content model.

Vocabulary file format​

Field reference​

Type name convention​

Constraint semantics​

Using vocabularies in a format reader​

Mapping native elements to semantic types​

SubType conventions​

Creating a custom vocabulary​

1. Create the JSON file​

2. Load it into the registry​

3. Map it in your reader​

SpanClassify tool​

Testing vocabularies​

Best practices​

Related reading​