All research
Paper № 03Video13 min read

VidWork-Bench

Temporal reasoning and professional workflow understanding in video models

Published

March 24, 2026

Authors

Datoric Research

Dataset

Procedural video segments (30s–5min). 7 frontier multimodal LLMs. 5 axes with bootstrap 95% CI.

Abstract

We evaluate seven frontier video models on procedural workflow understanding — GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude Opus 4.5, Claude Sonnet 4.5, and Claude Haiku 4.5 — across five axes: step recognition, temporal ordering, causal reasoning, error detection, and cross-modal grounding. Gemini 2.5 Pro leads with an average of 58.6% across axes; GPT-4o-mini trails at 48.8%. Error detection is catastrophically poor across the entire frontier: the best model (GPT-4o) reaches only 28.6% F1, missing 77% of intentional procedural errors while flagging 19.3% of correct procedures as faulty. The headline finding is a sharp clip-length degradation curve. Gemini 2.5 Pro drops from 74.8% accuracy on 30-second clips to 31.2% on 5-minute clips, a 43.6 percentage-point decline that follows an approximately log-linear relationship with duration.

Headline

−43.6pp

accuracy decline for the best model (Gemini 2.5 Pro) from 30s to 5min clips

Why workflows matter

The promising use cases for video AI are professional: surgical training review, factory line monitoring, equipment repair guidance, and instructional content synthesis. These tasks share a structural property that is missing from public video benchmarks. They are procedural. They require the model to understand not just what objects are present, but what order steps occur in, which steps depend on which, and whether the sequence was executed correctly.

VidWork-Bench is an attempt to measure procedural video understanding directly. We chose cooking as the pilot domain because it is procedurally rich, well annotated through the YouCook2 dataset, and close enough to everyday experience that the evaluation rubric is interpretable.

Headline finding

All seven frontier models lose accuracy precipitously as clip length grows. Gemini 2.5 Pro starts at 74.8% on 30-second clips and falls to 31.2% on 5-minute clips — a 43.6-point drop. The relationship is approximately log-linear in duration (R² = 0.97). Stronger models start higher but degrade more in absolute terms.

The benchmark

VidWork-Bench draws procedural video segments of 30 seconds to 5 minutes from sources including YouCook2, COIN, HowTo100M, and Ego4D. Each clip contains a complete sub-procedure with at least three identifiable steps, and is paired with structured questions across five evaluation axes. We evaluate seven frontier proprietary multimodal LLMs, each receiving 8–16 keyframes plus a timestamped ASR transcript:

  • 01Step recognition. F1 on the named procedure step depicted in the clip.
  • 02Temporal ordering. Accuracy on questions that require the model to order events.
  • 03Causal reasoning. Accuracy on questions that ask why a step was necessary.
  • 04Error detection. F1 on the detection of intentionally introduced procedure errors.
  • 05Cross-modal grounding. Accuracy on questions that tie visual events to spoken narration.

All scores are reported with 95% bootstrap confidence intervals. We test all seven models — GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5 — under matched conditions.

The length degradation curve

The most striking pattern in our data is the relationship between clip length and accuracy. Figure 1 shows accuracy as a function of the logarithm of clip length, with a linear fit for each model.

Figure

0.00.20.40.60.830s60s120s180s300sACCURACYCLIP LENGTH (LOG SECONDS)Gemini 2.5 ProGPT-4oClaude Sonnet 4.5Claude Opus 4.5Claude Haiku 4.5
Figure 1. Average accuracy across all five axes as a function of clip duration. Stronger models start higher but degrade more in absolute terms. The relationship is approximately log-linear in duration (R² = 0.97 for Gemini 2.5 Pro).

The degradation is universal and steep. Across all five tested models on the curve (Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4.5, Claude Opus 4.5, Claude Haiku 4.5), accuracy drops by 38–44 percentage points from 30 seconds to 5 minutes. Critically, the decline is not uniform across axes: temporal ordering accuracy degrades fastest (a 55.1-point drop for Gemini 2.5 Pro over 30s→5min) while step recognition degrades slowest (32.4 points). Models can still identify individual actions in long videos but progressively lose the ability to track their relative temporal positions.

If your deployment is a chef reviewing cooking videos, you need the model that works at five seconds. If your deployment is a factory floor monitoring twelve-hour shifts, you need the model that does not care how long the clip is.

Temporal ordering holds up; causal reasoning is the ceiling

Temporal ordering accuracy is moderate across the frontier — 57.8% to 68.2% — with Gemini 2.5 Pro in the lead. Causal reasoning, by contrast, is the ceiling: no model exceeds 63%. Temporal ordering is also the axis that degrades fastest with clip length, losing 55 points for Gemini 2.5 Pro between 30-second and 5-minute clips.

ModelTemporal (Acc)Causal (Acc)
Gemini 2.5 Pro68.2%62.7%
Claude Opus 4.567.8%62.1%
Claude Sonnet 4.567.1%61.9%
GPT-4o65.7%60.3%
Claude Haiku 4.564.6%58.7%
Gemini 2.5 Flash64.3%58.1%
GPT-4o-mini57.8%51.2%
Table 1. Temporal ordering and causal reasoning accuracy across all seven models.

Error detection is catastrophically poor

Error detection is the most alarming result in this paper. The best model (GPT-4o) achieves only 28.6% F1, with 38.2% precision and 22.7% recall. Across the frontier, models miss 77% of intentional procedural errors while simultaneously flagging 19.3% of correct procedures as containing errors. For AI-assisted quality control or safety monitoring, this is below what any reasonable process tolerance would accept.

ModelError Det. F1
GPT-4o28.6%
Gemini 2.5 Pro26.8%
Gemini 2.5 Flash25.7%
Claude Opus 4.525.1%
Claude Sonnet 4.524.3%
Claude Haiku 4.522.4%
GPT-4o-mini20.6%
Table 2. Error detection F1 across all seven models (higher is better). Best model catches barely one in four intentional errors.

The spread between the best and worst model on error detection is only 8 points, which is small by the standards of this benchmark. This is not a tiering problem. Every frontier model fails at procedural error detection. Any deployment that asks a video model to catch procedure violations should treat it as a screening layer that must be paired with human review, not an autonomous safety check.

Composite scores

Composite scores are an imperfect summary of a multi-axis benchmark, but they are useful as a single-number comparison. We compute the composite as the unweighted mean across the five axes, so that no axis dominates the ranking.

ModelStepTemporalCausalErrorX-ModalAvg.
Gemini 2.5 Pro71.468.262.726.864.158.6
Claude Opus 4.567.567.862.125.163.757.2
GPT-4o68.965.760.328.661.857.1
Claude Sonnet 4.566.367.161.924.363.456.6
Gemini 2.5 Flash67.864.358.125.760.255.2
Claude Haiku 4.563.264.658.722.460.553.9
GPT-4o-mini59.657.851.220.654.748.8
Table 3. Full results across five VidWork-Bench axes. All numbers are accuracy or F1 in percent (higher is better). Best per column in italics in the paper; here the Avg. column is highlighted.

The composite ranking places Gemini 2.5 Pro first with an average of 58.6%. Claude Opus 4.5 and GPT-4o are effectively tied for second (57.1–57.2%). GPT-4o-mini trails the field by nearly 10 points. This ranking does not tell the full story — GPT-4o leads specifically on error detection, while Gemini 2.5 Pro leads on step recognition, temporal ordering, causal reasoning, and cross-modal grounding. VidWork-Bench is designed to expose these axis-level tradeoffs rather than flatten them into a single number.

Cite this work

@article{datoric03,
  title={VidWork-Bench: Temporal reasoning and professional workflow understanding in video models},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.ai/research/vidwork-bench}
}

Data sources

  • YouCook2
  • COIN
  • HowTo100M
  • Ego4D

Continue reading