AD-030: Multimodal Extraction and LLM Refinement
Summary
Extracting translatable content from a non-text medium — text rendered into
an image, speech in an audio track, captions and on-screen text in video — is one
pattern, not three. In every case a fast local extractor turns raw media into
model.Blocks, each anchored back to a slice of the source and carrying a
per-unit confidence. The hard units are then escalated, low-confidence-first,
to a configurable multimodal LLM that re-reads just that slice — a crop for an
image, a time span for audio, a frame or clip for video.
This rests on two foundations defined elsewhere — it adds no new content-model primitive:
- The content model (AD-002) anchors a Block to its
source by a per-medium facet (run range for text,
geometryfor rendered media,timingfor timed media) and records how a recognized source was produced in itsOrigin(engine + confidence). Extractors populate these; the escalation reads them. - The provider interface (AD-011) carries
image/audio/videocontent parts and advertises each backend's input modalities.
On those, this AD adds three things:
- The confidence-gated escalation pattern — tier the readers, escalate only the units the local extractor is unsure of.
- One generic
media-refinetool — a source-Transform(AD-006) — that gates on the sourceOriginconfidence, slices the source via a per-modalityMediaSlicer, and rewrites the Block's source from the LLM response, behind faithfulness guards. - Plugin and pipeline symmetry —
kapi-asrand the video demux reader mirrorkapi-vision, so audio and video reuse the same tier.
The escalation tier is identical across modalities; only the anchor facet, the
slicer, and the LLM content part differ. Heavy extractors stay out-of-core in
plugins, mirroring kapi-vision and the SaT segmenter
(AD-021).
Context
The vision work (AD-029) establishes an OCR pipeline and an in-browser handwriting cascade — PP-OCRv5 reads every line fast, and lines below a confidence threshold are re-read by a handwriting model (TrOCR). That cascade is the seed of a general idea: a frontier multimodal LLM reads hard handwriting, garbled scans, accented or noisy speech, and ambiguous on-screen text far better than a small specialized model, because it brings a language prior and world knowledge to the disambiguation. But it is slower, costs per call, returns no calibrated confidence, and — the decisive risk for a faithfulness-first tool — fails dishonestly: handed an illegible crop it confabulates a plausible-but-wrong word rather than admitting defeat. So the LLM is never the primary reader; it is a narrow escalation over only the units the fast local extractor is unsure of, fed only the slice in question.
Audio and video are the same problem in different coordinates — speech and on-screen text are anchored in time (and, for video, time plus space) rather than on a still page. Generalizing the spatial anchor to also cover time, carrying the extractor's confidence as a first-class attribute the escalation gate reads, and giving the provider interface image/audio/video content parts are therefore one design, not an OCR feature: audio — automatic speech recognition (ASR), the audio counterpart of OCR: speech in, text out — and video extraction plug into the same escalation tier without reinvention.
Decision
The pattern: confidence-gated escalation over a media slice
Every modality runs the same three tiers, differing only by adapter:
- Tier 1 — fast local extractor. OCR (image), ASR (audio), or demux→ASR+OCR (video) over the whole input, emitting confidence-scored Blocks.
- Tier 2 — specialized local model (optional). A model tuned for the hard case: TrOCR for handwriting; a larger Whisper or domain ASR for difficult speech. Local, still credential-free.
- Tier 3 — configurable multimodal LLM. The residual low-confidence units only, each re-read from its source slice, with the provider explicitly selected — never an implicit fallback.
What is shared across modalities is everything that governs correctness: the confidence gate, a context hint (neighbouring extracted units passed as text so the LLM gets the language prior without shipping the whole page or track), the provenance tag, and the anti-confabulation guards below. What differs is captured entirely in the per-modality adapters.
| Image / OCR | Audio / ASR | Video | |
|---|---|---|---|
| Tier-1 extractor | PP-OCRv5 (kapi-vision) | Whisper-family (kapi-asr) | demux → ASR on audio track + OCR on frames |
| Anchor | spatial: page + Rect | temporal: [startMs, endMs] | both — time span + optional frame bbox |
| Slice | crop pixels | cut time range | frame-extract (+crop) or short clip |
| Confidence | CTC mean logit | segment avg_logprob / no_speech_prob | per-track |
| LLM content part | image | audio | image (frame) or video (clip) |
| Refusal token | [illegible] | [inaudible] | per-track |
What the extractor records
The escalation needs nothing the content model does not already define
(AD-002). Each extractor's block builder
(BlocksFromOCR, BlocksFromASR) populates, on every Block it emits:
- the anchor facet for its medium — a
geometryannotation (page + bounding box) for a rendered region, atimingannotation (time span) for an audio or video segment, both for on-screen text in video; and - the source
Origin— the extracting engine (ocr,asr) and a confidence, the same record a translation carries on its target side.
So a Block arrives at the refinement tier self-describing: confidence to gate on, an anchor that says which slice of the source to re-read, and an engine for the audit trail. The tier introduces no overlay of its own.
The provider side is equally already-defined: the multimodal aiprovider
(AD-011) carries the slice as an image/audio/video
content part, and media-refine reads InputModalities() to pick a provider that
accepts the slice's modality — erroring clearly rather than silently degrading if
none is configured.
The generic media-refine tool
One tool, dispatched by the anchor/modality, behind a MediaSlicer per modality:
type MediaSlicer interface {
// Opens the source by reference — src is a path/BlobKey/URI, never the whole
// asset in memory — and returns the bounded slice as a content part whose
// model.Media is inline when small, a BlobKey/URI when not, reading the
// block's anchor facet (AD-002): ImageCropper crops the geometry bbox;
// AudioCutter cuts the timing span; VideoClipper extracts a frame (+crop) or
// a short clip.
Slice(ctx context.Context, src MediaRef, b *model.Block) (aiprovider.ContentPart, error)
}
Source assets stay references end to end (model.Media BlobKey > URI > Data,
AD-002; plugins are path-based): only the bounded slice is
ever materialized, and the provider boundary resolves it to bytes
(AD-011) — a whole raster or track never enters the part
stream or a provider call.
Control flow:
- Gate — skip Blocks whose source
Originconfidence is at or above the threshold. - Slice — resolve the modality's
MediaSlicer; produce the content part. - Prompt —
[neighbouring extracted text as context] + [media part] + instruction to transcribe only the slice, returning the refusal token when unsure. - Call — the explicitly-configured provider (capability-checked).
- Rewrite — emit an
EditPlanrewriting the Block source; set its sourceOriginto thellm-refinedkind (AD-002) — enginellm:<provider>, with the prior recognizer's engine kept inreferenceso the refinement is queryable without losing the original provenance — and add aqafinding marking the unit for review when the LLM output diverges sharply from the Tier-1/Tier-2 guess or returns the refusal token.
media-refine is a source-Transform (AD-006): it
rewrites source, so it runs in a flow's leading source-transform stage — the
same slot redaction occupies (AD-020) — settling the source
before annotation and translation. It must access the source raster/track while it
still exists; the vision tier-3 reader consumes and deletes the page raster before
blocks reach tools, so media-refine runs inside the extraction boundary (the
slicer holds the source ref), not as an arbitrary downstream tool.
Two output modes: round-trip the text, or replace the asset
Extraction feeds two distinct localization modes — the same split AD-029 draws for images, generalized to timed media. They are independent, and an asset can use both at once.
Round-trip the text. The anchored Blocks localize as text and return to where they came from. The path depends on how the text lives in the asset:
- Text track / sidecar — full round-trip today. WebVTT, SubRip, and TTML are
first-class formats with reader and writer (
core/formats/{vtt,srt,ttml}, registered), so audio/video cues extract → translate → merge back into a localized track the source platform ingests. For raw audio/video, ASR produces the cues and thetiminganchor is the hand-off into the timed-text writer. Registered built-in flows compose this end to end —audio-to-subtitles,video-to-subtitles, andimage-ocr-translate. Video extraction emits both speech cues and geometry-anchored frame OCR, sovideo-to-subtitlesruns asubtitle-filterstep first: it keeps only timing-anchored, non-geometry Blocks, so on-screen text never leaks into the spoken-subtitle track. - Embedded text layer with a skeleton. PDF text layers and tagged formats re-apply targets through the format's writer (the skeleton mechanism, AD-029).
- Baked into pixels or waveform. OCR text burned into an image, or speech in an audio track, cannot return to the same rendered asset without re-rendering (out of scope, AD-029) or TTS (not designed). The localization is delivered as a companion instead — e.g. a generated localized subtitle track, itself a VTT/SRT write.
Replace the asset. Independently, the whole file is a localizable Media
asset: the target-asset variant model (AD-029 — IsBinaryAssetFormat,
ResolveAssetVariants) pairs a source image/audio/video with per-locale files and
treats a localized variant on disk as authoritative. This is medium-agnostic:
IsBinaryAssetFormat covers image, audio, and video, each with a passthrough
writer that emits the supplied per-locale bytes — audio and video swap wholesale
exactly as images do. The engine never synthesizes localized media (no TTS, no
re-encode); a replacement is a file the user or a connector provides.
The modes compose on one asset: a video's subtitle track round-trips as text while the file itself stays replaceable; an image carries localized OCR/alt-text Blocks and is swappable.
Plugin and pipeline symmetry
kapi-asrmirrorskapi-visionand the SaT plugin exactly: a cgo-tags onnx(whisper.cpp / ONNX) binary loading its native stack at runtime, isolated from the portablekapibinary, driven over a stdin/stdout protocol, and path-based — the host passes a media path, never bytes (AD-021, AD-029).- Video is a demux format reader that emits an audio child Layer (→ ASR)
and a visual child Layer (→ frame OCR), reusing the Layer-nesting model that
already handles embedded content (HTML-in-JSON → child Layer,
AD-002). It writes no transcription code of its own — it
composes
kapi-asr+kapi-vision. - Labs extend the same way, the same engines as the plugins with only the
runtime differing (WebAssembly, not native). The in-browser Vision Lab is the
image instance of the pattern; the Audio Lab is its direct analog
(transformers.js runs a Whisper
automatic-speech-recognitionpipeline in-browser, as the Vision Lab runs OCR and TrOCR); the Video Lab composes both —ffmpeg.wasmdemuxes the clip into an audio track (→ Whisper) and sampled frames (→ PP-OCRv5), the browser instance of the demux-format reader above. A pre-recorded Multimodal Showcase plays the whole story over canned data, so it needs no model download. The browser bridges live in@neokapi/kapi-playground(asrBridge,avBridge,visionBridge).
Faithfulness and guards
The confabulation risk is identical across modalities, so the guards are too:
- Refusal token. The model is instructed to return
[illegible]/[inaudible]rather than guess; the token marks the unit for review (aqafinding), not fabricated source. - Divergence check. A Tier-3 result that disagrees sharply with the Tier-1/2 guess (both low-confidence) is flagged for review rather than silently accepted.
- Provenance is visible. LLM-sourced source text is the least-verified tier;
the editor renders the source
Origin(engine + confidence) and anyqareview finding (AD-027) so a reviewer sees exactly which units a model invented versus read. - Slice, never page. Only the low-confidence slice plus a text context hint leaves the process — bounding both cost and data exposure.
Provider credentials
The configurable multimodal LLM tier runs server/CLI-side and draws its
provider + model + key from the same credential path as the other AI tools —
the keychain and environment (AD-011,
AD-013). "Configurable like the other AI tools" is satisfied
there. The in-browser Labs demonstrate the local extraction tiers (OCR, ASR, the
handwriting cascade); credentialed cloud refinement is a CLI/desktop capability,
not a browser one.
Related
- AD-002 Content Model — anchor facets (geometry/timing) and source
Originconfidence this tier reads - AD-006 Tool System — capability views, source-transform stage
- AD-011 AI Providers — the multimodal
LLMProviderthis tier sends slices to - AD-020 Content Redaction — the recoverable-Transform precedent
- AD-021 SaT Segmenter Plugin — native-stack plugin isolation template
- AD-027 Visual Editor — renders source provenance and qa review findings
- AD-029 Vision and Image Localization — the image instance of this pattern