AD-029: Vision and Image Localization
Summary
An image is a localizable asset, not merely a carrier of text. The image
format reads PNG/JPEG and always emits the picture as a model.Media part — the
unit a localization flow can replace wholesale with a per-locale variant.
On top of that base, the out-of-core kapi-vision plugin adds optional
document-vision enrichment:
- OCR — RapidOCR / PP-OCRv5 text detection + recognition (shipped v0.1.0).
- Layout — PP-DocLayoutV3 (RT-DETR) region detection + reading order, yielding tier-3 structure (shipped v0.2.0).
Vision mirrors the kapi-sat and kapi-pdfium plugins: a cgo -tags onnx
binary that loads onnxruntime at runtime, isolated from the portable kapi
binary, driven over a binary-framed stdin/stdout protocol. Like the PDF reader
(AD-028), it is path-based — the host passes a
file path, never image bytes, so the picture lives only in the plugin process.
OCR and layout are opt-in capabilities (ocr, layout config toggles);
with both off, an image is a Media asset only.
Context
Localizing a document that contains images is not one problem but several, and treating "image" as "OCR" conflates them. The distinct modes are:
| Mode | What it localizes | Mechanism |
|---|---|---|
| Whole-image replacement | the pixels | a localized image file per locale swaps the source (screenshots, graphics with baked-in text); pseudo-localization (a visible watermark variant) ships today |
| Alt-text / caption | accessible text, not pixels | the alt text is emitted as a translatable caption Block linked to the image (RoleCaption + RelCaptionOf) and localized through the normal block path |
| Metadata | embedded title/description/keywords | translatable metadata fields → metadata-plane Blocks; non-translatable fields → namespaced Layer properties (core/docmeta) |
| In-image text (OCR) | text rendered into the image | extract → translate → (optionally) re-render |
| Layout / structure | the document's regions | detect regions + reading order, with table regions reconstructed into row/column cell structure, for faithful reconstruction |
Whole-image replacement is the most common and the simplest to reason about: the
translator (or an automated pipeline) supplies a localized picture. The others
are enrichment. The content model already carries what these need —
model.Media{Data│BlobKey│URI, AltText} + PartMedia, and the structure/role
annotations (AD-002) — so the architecture's job is to
keep "image" generic and make vision an optional layer, not the identity of the
format.
The ML capabilities carry the same native-stack weight as the SaT segmenter
(AD-021) — onnxruntime, large model assets — so
they live in a plugin, never in kapi.
Decision
The image format is a localizable asset
core/formats/image reads PNG/JPEG and always emits the image as a Media
part referenced by URI (never inline bytes — the binary never travels through the
kapi part stream). This alone supports whole-image localization: the Media is the
asset; a localized variant is a different file. A matching image writer emits
a Media part's bytes — the whole-image localization sink — so a transform
that produces a localized image variant can be written back out.
Alt-text / caption
An image's accessible text is localized as content, not as a Media field. When an
<image>.alt.txt sidecar sits beside the source, the reader attaches its text to
the Media (AltText, for display) and emits it as a translatable caption
Block linked to the image (RoleCaption + a caption-of relation to the Media
ID). That block flows through the ordinary block path — TM, AI translate, brand
voice, sessions, batching — with no special tool support, and gets per-locale
Targets like any other block. The image writer folds the localized target (or
the source text, as a round-trip fallback) back into a per-locale
<output>.alt.txt sidecar beside the written image. Modeling alt-text as a
linked block (rather than mutating the single Media.AltText field in place)
keeps it per-locale and reuses the whole translation stack; Media.AltText
remains the source value for display.
Pseudo-localization
The first localized-image transform is pseudo-localization — the visual
analog of text pseudo-translation. The pseudo-translate tool, on encountering
an image Media part, replaces it with a clearly-visible watermarked variant (a
color wash + a solid border + a diagonal band; core/imageops.PseudoLocalize)
and pseudo-translates the alt-text. Read an image → pseudo-translate → write, and
the output is an unmistakably-marked image — proof, in a UI or build artifact,
that image localization actually swapped the asset. It is deterministic and
dependency-free (standard-library raster ops only).
Metadata
Embedded document metadata is localized the same way, via the shared
core/docmeta helper. Metadata is document-level — not anchored to any run —
so it lives on the Layer, never in a run-anchored overlay: translatable
fields (title, description, keywords) become Blocks on the metadata plane
(StructureAnnotation.Layer == LayerMetadata) that localize through the normal
block path, while non-translatable fields (author, copyright, software, dates)
are recorded as namespaced Layer.Properties (png:author, xmp:dc:creator,
…) — never translated, kept for inspection. This mirrors the OOXML reader's
treatment of docProps/core.xml (translatable Dublin-Core fields become blocks;
the rest stays skeleton), generalized to formats whose round-trip is a byte copy.
The image reader reads PNG text chunks (tEXt/iTXt/zTXt) and embedded XMP
(PNG and JPEG dc:title/dc:description/dc:subject/dc:creator) without
loading the pixel data — it stops scanning at the first image-data chunk. The
same core/docmeta path carries the PDF Info dictionary
(AD-028).
Scope: extraction surfaces metadata for translation, TM, and inspection. Whether the localized metadata is re-embedded depends on the writer — a skeleton-based format (OOXML) re-applies the translated field, and a cross-format conversion (PDF → Markdown/HTML) carries the metadata blocks into the output document. The byte-copy image writer preserves the source image's original embedded metadata unchanged; re-encoding localized PNG text chunks / XMP back into the raster, like binary EXIF/IPTC parsing, is a documented follow-up.
Two config toggles gate the enrichment, both default-on:
ocr— run in-image text recognition (requires the plugin). Off → Media only.layout— run ML layout when OCR runs; off → geometric structure (tier 2).
kapi-vision — out-of-core, path-based
The plugin is its own Go module (plugins/vision), isolated so its cgo +
onnxruntime stack never enters another build graph. Its engine has two builds,
like kapi-sat: a default pure-Go stub (so the module and the protocol/algorithm
tests build with no native dependency) and the real -tags onnx engine. The
host-side vision engine (cli/vision_plugin.go) discovers and spawns the
plugin and drives it over visionproto (a length-prefixed binary frame protocol,
not line-JSON — image references and structured results), mirroring the wire
structs rather than importing the plugin module.
core/vision is the framework seam: an Engine (OCR) interface + an optional
LayoutEngine interface (type-asserted, so OCR-only backends need not implement
it) + a name-keyed registry, exactly like core/segment. Both methods are
path-based.
OCR — PP-OCRv5
The OCR engine runs the PP-OCRv5 mobile detection (DBNet) and recognition
(CRNN+CTC) models: it builds an MCID-free pipeline — binarize the detection
probability map, extract connected-component boxes, "unclip" them, recognize each
crop and CTC-decode against the PP-OCRv5 dictionary. Recognized lines carry
top-left pixel geometry; the image reader feeds them to the geometric tier-2
(core/structure.Analyze) when layout is unavailable.
Layout — PP-DocLayoutV3 (tier 3)
The layout engine runs PP-DocLayoutV3, an RT-DETR detector. RT-DETR is NMS-free:
given PaddleDetection's image / scale_factor / im_shape inputs it returns
already-decoded detections in original pixel coordinates. Its 25 region classes
map to content roles (doc_title→title, paragraph_title→heading, table→table,
figure/chart/image→picture, formulas, footnotes, headers/footers, …). A
deterministic column-clustering heuristic assigns reading order. The image reader
then assigns OCR lines to layout regions by containment and emits role-tagged
blocks in reading order — tier-3 structure — with the geometric tier-2 as
fallback. A table region's lines are reconstructed into row/column cell
structure (table → table-row → table-cell/table-header) by reusing the
tier-2 grid clustering (structure.Gridify), so both tiers emit tables
identically (structure.TableToParts) and writers render a real table.
Model distribution
OCR's models are small (~21 MB) and bundled in the release tarball beside the
binary (resolved with no configuration), with onnxruntime 1.25.0 (matching
yalue/onnxruntime_go's C API). The layout model is large (~132 MB) and
download-on-demand to the XDG cache on first use. Model resolution searches an
override dir, the bundled dir beside the binary, then the cache; downloads of
on-demand models go to the writable cache. All models are Apache-2.0, mirrored on
a neokapi release asset with pinned hashes. The plugin is not a kapi-cli
dependency — vision is opt-in (kapi plugins install vision).
Consequences
- "Image" stays a generic, localizable format; OCR and layout are optional layers that degrade gracefully (absent plugin, or toggled off) to whole-image Media.
- The portable
kapibinary stays pure-Go and small; the onnxruntime stack is confined to the plugin, and image bytes never enter the host. - Tier-3 structure (authoritative roles + reading order) is available for images, and — once a page rasterizer is wired — for the PDF tier-3 slot in AD-028, since the vision engine is format-agnostic over rasters.
- Whole-image replacement is supported end-to-end: the image is emitted as a
localizable Media, a writer emits localized bytes, pseudo-localization produces
a visible variant, and the target-asset model pairs a source image with its
per-locale files.
project.ResolveAssetVariantsresolves each locale's target path (via the recipe'starget:template) and reports which variants exist — the local counterpart of Bowrain's server-side asset-variant model (AD-007). Because kapi cannot regenerate a real image localization, a localized variant already on disk is authoritative:kapi run/kapi mergekeep it rather than clobber it by reprocessing the source (project.IsBinaryAssetFormatgates this for binary-asset formats), while a missing variant falls through to the flow to produce a pseudo/copy fallback.
Related
- AD-002: Content Model — Media, standoff structure/role annotations, the structure stream vision produces
- AD-021: SaT Segmenter Plugin — the precedent for an isolated onnxruntime plugin
- AD-028: PDF Reader and Structure Tiers — the tier-1/2/3 structure model; vision is its tier-3 engine
plugins/vision/— plugin module, engine, protocol, and README