The Qwenperor has no Clothes - a tale of video ingestion woe

I finally got around to testing the “live video ingestion” in Qwen 2.5 VL, which Alibaba has been pretty loud about in their press and demos. If you’ve been watching the AI model scene, you know everyone claims truly novel video capabilities. I wanted to see if someone had actually cracked native video understanding at the API level.

Here’s what actually happened.

What I Expected

Based on the promo material, I went in thinking Qwen could directly ingest video and do something beyond the usual frame sampling. To me, “live” should mean the model or its tokenizer is looking at the video file as a first-class input, not snapshots or base64 blobs labeled as “frames.” I figured maybe there was real temporal reasoning or sequence awareness happening under the hood.

How It Works In Practice

Short version: it doesn’t work that way.

The Qwen API requires access to the full video file, either locally or via a remote URL. You don’t get to stream anything or pass video data directly to the API. The demo sidesteps this by downloading and saving the video locally, then running it through what they call their “tokenizer.”

Out of curiosity, I tore into the tokenizer code. It’s a basic frame extractor with resizing. No real video understanding beyond what you’d get piping stills into an image model. There’s no attempt at capturing sequence, context, or anything meaningfully temporal.

No Audio, Either

One thing that really stood out: they completely ignore audio. There isn’t even a stub for syncing sound to frames or letting the model look at what’s being said in a clip. This is a huge gap if you want real video understanding. People rarely communicate just with images—narration, dialog, music, background cues all matter for knowing what’s happening in a scene.

Something We Do Differently in Pulse

I won’t get into the weeds on the backend, but this is the main reason Pulse’s video ingestion pulls ahead. For every frame batch, the corresponding audio segment is included before the model is asked to reason about context. The result is a lot closer to “native video understanding” than just running image OCR on freeze frames. You end up with output that actually incorporates what’s said, heard, and shown.

Takeaways

If you were hoping Qwen 2.5 VL would finally deliver on the promise of real-time, native, end-to-end video understanding… it isn’t there yet. Instead, you get a basic frame grabber without context or audio support, and the API is pretty clunky in terms of data ingestion.

If you want video analysis that pays attention to both what’s happening on-screen and what’s being said, Pulse is still one of the only ways I know to get that directly out of the box.

If you’ve tried Qwen (or another model that actually delivers what they promise on video), let me know. I’d love to see it working in the wild, but it’s not here yet.

The Qwenperor has no Clothes - a tale of video ingestion woe

What I Expected

How It Works In Practice

No Audio, Either

Something We Do Differently in Pulse

Takeaways

Written by:

Matt Krueger

Member discussion: