Gå til hovedinnhold

ML model benchmark

kapi's subjective checks (voice/style similarity, register, do-not-translate by entity) are served by small, open, multilingual models run in-process through the same ONNX runtime the segmenter uses. The cost a user pays is not the per-sentence inference — that is cheap — but the model download and the resident memory. This page measures both, so the choice between running a check on your machine and running it server-side is grounded in numbers.

Measured cost per model

Model / variantWhat it checksDownloadLoadInferencePeak memoryLicense
e5-smallVoice / style similarity (sentence embeddings)464.8 MB248.1 ms4.7 ms1183.3 MBMIT
e5-small-O4Voice / style similarity — fp16-optimized (O4) variant240.5 MB183.2 ms4.68 ms520 MBMIT
e5-small-int8Voice / style similarity — int8-quantized variant129.2 MB83.1 ms2.66 ms39.5 MBMIT
gliner-multiDo-not-translate / entity spotting (zero-shot NER)1119.1 MBApache-2.0
gliner-multi-int8GLiNER — int8-quantized variant (download/footprint mitigation)348.5 MBApache-2.0
formality no ONNX yetRegister / formality classification2.8 MBacademic (s-nlp)
sat-3l-smReference: the segmenter kapi already ships (kapi-sat)408.5 MBMIT

Darwin arm64 · Python 3.11.6 · onnxruntime 1.26.0. Inference is the mean over repeated runs on short multilingual sentences. onnxruntime CPU; the Go plugin bundles libonnxruntime (~18 MB) per platform. Size-only rows are not yet wired for in-process inference (GLiNER's zero-shot input differs; the formality ranker ships PyTorch weights that need an ONNX export). “Peak memory” is the resident set the loaded session adds.

What the numbers say

  • Per-sentence inference is not the cost. A loaded embedding model scores a sentence in single-digit milliseconds — fast enough to run on every block in a pipeline.
  • The full-precision footprint is the cost. The fp32 export of a 118M-parameter embedding model is a ~465 MB download and over a gigabyte of resident memory — too heavy to load casually inside a CLI that runs next to your editor and your build.
  • Quantization changes the verdict. The int8 export of the same model is a ~129 MB download and ~40 MB resident — and slightly faster. That is small enough to ship as an explicitly-installed plugin and cache, which makes a single small-model checker viable to run on your machine.
  • Some models stay heavy even quantized. The generalist NER model is ~1.1 GB at full precision and still ~330 MB int8 — defensible as an opt-in download, but a poor default for a laptop, and a natural fit for a server that hosts it once.

Standalone, or server-side?

The deterministic checks — terminology, do-not-translate by string, placeholder and tag integrity, register by lexicon — have no model and no download; they always run locally and free. The question is only where the model-backed checks run. Three options:

Option A — small model local, heavy models server-side (recommended)

Ship the int8 embedding model as an optional plugin the user explicitly installs (~129 MB, ~40 MB resident) for voice/style similarity and register; run the generalist NER and any LLM-deep check server-side, where the model is hosted once and amortized across a team and across large batches. Keeps the CLI lean and offline-capable for the common case, without asking every user to download a gigabyte.

Option B — all model-backed checks server-side

kapi stays purely deterministic offline; every ML-backed check is a call to a server you run. Simplest CLI and smallest install, at the cost of the offline subjective checks and a network dependency for them.

Option C — all models local (quantized)

Ship int8 exports of every checker (embedding ~129 MB + NER ~330 MB + register). Maximal offline capability and no server needed, at the cost of a few hundred MB of one-time downloads and a heavier resident footprint when several run together.

The data points to Option A: int8 makes one small model cheap enough to live in the CLI, while the heavy generalist model earns its keep server-side — which is also where batch volume (tens of thousands of strings across many languages) is most economical to process.

How the model is acquired

A checker that needs a model should acquire it explicitly, never by a surprise download in the middle of a kapi check. Consumer ML tools (Hugging Face transformers, Whisper) lazy-download on first use, which is convenient but hangs the first run and fails in airgapped or CI environments. Developer tools make it explicit and pinnable — vale sync, spacy download,ollama pull — and kapi already follows that model for its native deps (kapi plugin install okapi-bridge). The model-backed checker is the same: an opt-in plugin you install (its model bundled in the release tarball, the way the segmenter bundles the ONNX runtime, or pulled by an explicit step), so the download is a deliberate, cacheable, offline-after-install action with a known version.

When the plugin or its model is absent, kapi check still runs every deterministic check and reports the model-backed check as unavailable with the one command that enables it — fail-closed with guidance, not a silent network call. In CI, the install is a setup step (as connector and plugin installs already are), so runs stay deterministic and offline once the cache is warm.

This is realized today as the kapi-check plugin (kapi plugins install check, then kapi-check pull downloads the int8 model) and kapi check --voice, which scores each block against a brand profile's examples and reports an advisory finding below the --voice-mincosine cutoff. Because multilingual embedding cosines cluster high, that cutoff is calibrated per profile rather than shipped as a universal number — the honest stance for a proxy.