Why workflows matter
The promising use cases for video AI are professional: surgical training review, factory line monitoring, equipment repair guidance, and instructional content synthesis. These tasks share a structural property that is missing from public video benchmarks. They are procedural. They require the model to understand not just what objects are present, but what order steps occur in, which steps depend on which, and whether the sequence was executed correctly.
VidWork-Bench is an attempt to measure procedural video understanding directly. We chose cooking as the pilot domain because it is procedurally rich, well annotated through the YouCook2 dataset, and close enough to everyday experience that the evaluation rubric is interpretable.
Headline finding
All seven frontier models lose accuracy precipitously as clip length grows. Gemini 2.5 Pro starts at 74.8% on 30-second clips and falls to 31.2% on 5-minute clips — a 43.6-point drop. The relationship is approximately log-linear in duration (R² = 0.97). Stronger models start higher but degrade more in absolute terms.
The benchmark
VidWork-Bench draws procedural video segments of 30 seconds to 5 minutes from sources including YouCook2, COIN, HowTo100M, and Ego4D. Each clip contains a complete sub-procedure with at least three identifiable steps, and is paired with structured questions across five evaluation axes. We evaluate seven frontier proprietary multimodal LLMs, each receiving 8–16 keyframes plus a timestamped ASR transcript:
- 01Step recognition. F1 on the named procedure step depicted in the clip.
- 02Temporal ordering. Accuracy on questions that require the model to order events.
- 03Causal reasoning. Accuracy on questions that ask why a step was necessary.
- 04Error detection. F1 on the detection of intentionally introduced procedure errors.
- 05Cross-modal grounding. Accuracy on questions that tie visual events to spoken narration.
All scores are reported with 95% bootstrap confidence intervals. We test all seven models — GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5 — under matched conditions.
The length degradation curve
The most striking pattern in our data is the relationship between clip length and accuracy. Figure 1 shows accuracy as a function of the logarithm of clip length, with a linear fit for each model.
Figure
The degradation is universal and steep. Across all five tested models on the curve (Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.5, Claude Opus 4.5, Claude Haiku 4.5), accuracy drops by 38–44 percentage points from 30 seconds to 5 minutes. Critically, the decline is not uniform across axes: temporal ordering accuracy degrades fastest (a 55.1-point drop for Gemini 2.5 Pro over 30s→5min) while step recognition degrades slowest (32.4 points). Models can still identify individual actions in long videos but progressively lose the ability to track their relative temporal positions.
If your deployment is a chef reviewing cooking videos, you need the model that works at five seconds. If your deployment is a factory floor monitoring twelve-hour shifts, you need the model that does not care how long the clip is.
Temporal ordering holds up; causal reasoning is the ceiling
Temporal ordering accuracy is moderate across the frontier — 57.8% to 68.2% — with Gemini 2.5 Pro in the lead. Causal reasoning, by contrast, is the ceiling: no model exceeds 63%. Temporal ordering is also the axis that degrades fastest with clip length, losing 55 points for Gemini 2.5 Pro between 30-second and 5-minute clips.
| Model | Temporal (Acc) | Causal (Acc) |
|---|---|---|
| Gemini 2.5 Pro | 68.2% | 62.7% |
| Claude Opus 4.5 | 67.8% | 62.1% |
| Claude Sonnet 4.5 | 67.1% | 61.9% |
| GPT-4o | 65.7% | 60.3% |
| Claude Haiku 4.5 | 64.6% | 58.7% |
| Gemini 2.5 Flash | 64.3% | 58.1% |
| GPT-4o-mini | 57.8% | 51.2% |
Error detection is catastrophically poor
Error detection is the most alarming result in this paper. The best model (GPT-4o) achieves only 28.6% F1, with 38.2% precision and 22.7% recall. Across the frontier, models miss 77% of intentional procedural errors while simultaneously flagging 19.3% of correct procedures as containing errors. For AI-assisted quality control or safety monitoring, this is below what any reasonable process tolerance would accept.
| Model | Error Det. F1 |
|---|---|
| GPT-4o | 28.6% |
| Gemini 2.5 Pro | 26.8% |
| Gemini 2.5 Flash | 25.7% |
| Claude Opus 4.5 | 25.1% |
| Claude Sonnet 4.5 | 24.3% |
| Claude Haiku 4.5 | 22.4% |
| GPT-4o-mini | 20.6% |
The spread between the best and worst model on error detection is only 8 points, which is small by the standards of this benchmark. This is not a tiering problem. Every frontier model fails at procedural error detection. Any deployment that asks a video model to catch procedure violations should treat it as a screening layer that must be paired with human review, not an autonomous safety check.
Composite scores
Composite scores are an imperfect summary of a multi-axis benchmark, but they are useful as a single-number comparison. We compute the composite as the unweighted mean across the five axes, so that no axis dominates the ranking.
| Model | Step | Temporal | Causal | Error | X-Modal | Avg. |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 71.4 | 68.2 | 62.7 | 26.8 | 64.1 | 58.6 |
| Claude Opus 4.5 | 67.5 | 67.8 | 62.1 | 25.1 | 63.7 | 57.2 |
| GPT-4o | 68.9 | 65.7 | 60.3 | 28.6 | 61.8 | 57.1 |
| Claude Sonnet 4.5 | 66.3 | 67.1 | 61.9 | 24.3 | 63.4 | 56.6 |
| Gemini 2.5 Flash | 67.8 | 64.3 | 58.1 | 25.7 | 60.2 | 55.2 |
| Claude Haiku 4.5 | 63.2 | 64.6 | 58.7 | 22.4 | 60.5 | 53.9 |
| GPT-4o-mini | 59.6 | 57.8 | 51.2 | 20.6 | 54.7 | 48.8 |
The composite ranking places Gemini 2.5 Pro first with an average of 58.6%. Claude Opus 4.5 and GPT-4o are effectively tied for second (57.1–57.2%). GPT-4o-mini trails the field by nearly 10 points. This ranking does not tell the full story — GPT-4o leads specifically on error detection, while Gemini 2.5 Pro leads on step recognition, temporal ordering, causal reasoning, and cross-modal grounding. VidWork-Bench is designed to expose these axis-level tradeoffs rather than flatten them into a single number.