Video Lab

Drop in a video and the Lab runs the whole multimodal pipeline in your browser. ffmpeg.wasm demuxes it into an audio track and sampled frames; Whisper transcribes the speech into a subtitle track; and PP-OCRv5 reads the on-screen frame text, overlaid at its timecode — the same engines the native kapi-av / kapi-asr / kapi-vision plugins run, only the runtime differs. The ffmpeg core (~32 MB) and the Whisper model (~40 MB) load on first use. Nothing is mocked. For an instant, no-download tour, see the Multimodal Showcase.

Loading the interactive lab…