ML model benchmark

kapi's subjective checks (voice/style similarity, register, do-not-translate by entity) are served by small, open, multilingual models run in-process through the same ONNX runtime the segmenter uses. The cost a user pays is not the per-sentence inference — that is cheap — but the model download and the resident memory. This page measures both, so the choice between running a check on your machine and running it server-side is grounded in numbers.

Measured cost per model

Model / variant	What it checks	Download	Load	Inference	Peak memory	License
`e5-small`	Voice / style similarity (sentence embeddings)	464.8 MB	248.1 ms	4.7 ms	1183.3 MB	MIT
`e5-small-O4`	Voice / style similarity — fp16-optimized (O4) variant	240.5 MB	183.2 ms	4.68 ms	520 MB	MIT
`e5-small-int8`	Voice / style similarity — int8-quantized variant	129.2 MB	83.1 ms	2.66 ms	39.5 MB	MIT
`gliner-multi`	Do-not-translate / entity spotting (zero-shot NER)	1119.1 MB	—	—	—	Apache-2.0
`gliner-multi-int8`	GLiNER — int8-quantized variant (download/footprint mitigation)	348.5 MB	—	—	—	Apache-2.0
`formality` no ONNX yet	Register / formality classification	2.8 MB	—	—	—	academic (s-nlp)
`sat-3l-sm`	Reference: the segmenter kapi already ships (kapi-sat)	408.5 MB	—	—	—	MIT

Darwin arm64 · Python 3.11.6 · onnxruntime 1.26.0. Inference is the mean over repeated runs on short multilingual sentences. onnxruntime CPU; the Go plugin bundles libonnxruntime (~18 MB) per platform. Size-only rows are not yet wired for in-process inference (GLiNER's zero-shot input differs; the formality ranker ships PyTorch weights that need an ONNX export). “Peak memory” is the resident set the loaded session adds.

What the numbers say

Per-sentence inference is not the cost. A loaded embedding model scores a sentence in single-digit milliseconds — fast enough to run on every block in a pipeline.
The full-precision footprint is the cost. The fp32 export of a 118M-parameter embedding model is a ~465 MB download and over a gigabyte of resident memory — too heavy to load casually inside a CLI that runs next to your editor and your build.
Quantization changes the verdict. The int8 export of the same model is a ~129 MB download and ~40 MB resident — and slightly faster. That is small enough to ship as an explicitly-installed plugin and cache, which makes a single small-model checker viable to run on your machine.
Some models stay heavy even quantized. The generalist NER model is ~1.1 GB at full precision and still ~330 MB int8 — defensible as an opt-in download, but a poor default for a laptop, and a natural fit for a server that hosts it once.

Standalone, or server-side?

The deterministic checks — terminology, do-not-translate by string, placeholder and tag integrity, register by lexicon — have no model and no download; they always run locally and free. The question is only where the model-backed checks run. Three options:

Option A — small model local, heavy models server-side (recommended)

Ship the int8 embedding model as an optional plugin the user explicitly installs (~129 MB, ~40 MB resident) for voice/style similarity and register; run the generalist NER and any LLM-deep check server-side, where the model is hosted once and amortized across a team and across large batches. Keeps the CLI lean and offline-capable for the common case, without asking every user to download a gigabyte.

Option B — all model-backed checks server-side

kapi stays purely deterministic offline; every ML-backed check is a call to a server you run. Simplest CLI and smallest install, at the cost of the offline subjective checks and a network dependency for them.

Option C — all models local (quantized)

Ship int8 exports of every checker (embedding ~129 MB + NER ~330 MB + register). Maximal offline capability and no server needed, at the cost of a few hundred MB of one-time downloads and a heavier resident footprint when several run together.

The data points to Option A: int8 makes one small model cheap enough to live in the CLI, while the heavy generalist model earns its keep server-side — which is also where batch volume (tens of thousands of strings across many languages) is most economical to process.

How the model is acquired

A checker that needs a model should acquire it explicitly, never by a surprise download in the middle of a kapi check. Consumer ML tools (Hugging Face transformers, Whisper) lazy-download on first use, which is convenient but hangs the first run and fails in airgapped or CI environments. Developer tools make it explicit and pinnable — vale sync, spacy download,ollama pull — and kapi already follows that model for its native deps (kapi plugin install okapi-bridge). The model-backed checker is the same: an opt-in plugin you install (its model bundled in the release tarball, the way the segmenter bundles the ONNX runtime, or pulled by an explicit step), so the download is a deliberate, cacheable, offline-after-install action with a known version.

When the plugin or its model is absent, kapi check still runs every deterministic check and reports the model-backed check as unavailable with the one command that enables it — fail-closed with guidance, not a silent network call. In CI, the install is a setup step (as connector and plugin installs already are), so runs stay deterministic and offline once the cache is warm.

This is realized today as the kapi-check plugin (kapi plugins install check, then kapi-check pull downloads the int8 model) and kapi check --voice, which scores each block against a brand profile's examples and reports an advisory finding below the --voice-mincosine cutoff. Because multilingual embedding cosines cluster high, that cutoff is calibrated per profile rather than shipped as a universal number — the honest stance for a proxy.