Paper № 03Video13 min read

VidWork-Bench

A five-axis benchmark for procedural video understanding

Published

March 24, 2026

Authors

Datoric Research

Dataset

171 procedural clips (30s–5min) across cooking, repair/manufacturing, and first-aid/safety. 2,092 QA items across 5 axes. 10,686 scored model responses. 6 frontier vision–language models. Bootstrap 95% CI.

Download PDF View PDF

Abstract

VidWork-Bench is a five-axis evaluation framework (step recognition, temporal ordering, causal reasoning, cross-modal grounding, error detection) for procedural video understanding, instantiated on 171 clips across cooking, repair/manufacturing, and first-aid/safety, yielding 2,092 QA items and 10,686 scored model responses across six frontier vision–language models (GPT-4o, GPT-4o-mini, Gemini 2.5 Flash, Claude Haiku 4.5, Sonnet 4.5, Opus 4.5). Gemini 2.5 Pro was attempted but returned persistent 503 errors and is not evaluated. Three findings structure the release. First, a paired single-frame versus 8-frame ablation on the two axes that most plausibly require multi-frame evidence — temporal ordering and causal reasoning — finds no statistically significant benefit from multi-frame context for Claude Sonnet 4.5 or GPT-4o; on GPT-4o causal reasoning the 1-frame condition is significantly better (Δ = +0.014, 95% CI [+0.002, +0.026]). Second, Claude Sonnet 4.5 wins the composite leaderboard at 0.446, narrowly above Claude Haiku 4.5 (0.422) and Opus 4.5 (0.420), with GPT-4o-mini at 0.335 and GPT-4o at 0.313. Third, on the adversarial error-detection axis Claude models detect procedural errors at 85–97% versus 38–76% for the GPT-4o family across six non-degenerate error types (n = 1,090). The gap may partly reflect a higher Claude flagging propensity; the reported gap is a detection-plus-propensity composite rather than isolated detection accuracy.

Headline

0.446

composite leaderboard winner — Claude Sonnet 4.5 narrowly ahead of Haiku 4.5 and Opus 4.5, with the GPT-4o family 8–13 points behind

Why workflows matter

The promising use cases for video AI are professional: manufacturing QC, medical training review, equipment repair guidance, first-aid and safety monitoring. These tasks share a structural property that is missing from public video benchmarks. They are procedural. They require the model to understand not just what objects are present, but what order steps occur in, which steps depend on which, and whether the sequence was executed correctly.

VidWork-Bench is a five-axis framework for measuring procedural video understanding: step recognition, temporal ordering, causal reasoning, cross-modal grounding, and adversarial error detection. We instantiate it on 171 curated clips across three domains (cooking, repair/manufacturing, first-aid/safety) and four duration buckets (30s, 60s, 180s, 300s), generating 2,092 QA items after three-model difficulty filtering. The adversarial error-detection axis has no direct analogue in prior procedural video work.

Main result

A paired 1-frame vs 8-frame ablation on the two axes most plausibly multi-frame-dependent — temporal ordering and causal reasoning — finds no statistically significant benefit from multi-frame context for Claude Sonnet 4.5 or GPT-4o. On GPT-4o causal reasoning the 1-frame condition is significantly better (Δ = +0.014, 95% CI [+0.002, +0.026]). Any benchmark claiming to measure "temporal reasoning" by serving more frames needs to verify the extra frames are actually being used.

The benchmark

VidWork-Bench draws 171 clips from YouCook2 (cooking), COIN (repair/manufacturing), and curated instructional content (first-aid/safety), binned into four duration buckets (72 × 30s, 42 × 60s, 40 × 180s, 17 × 300s). Each clip contains at least three identifiable steps. The 2,092 QA items are distributed across five axes: step recognition (n = 423), temporal ordering (n = 90), causal reasoning (n = 225), cross-modal grounding (n = 264), and adversarial error detection (n = 1,090). Items are filtered against a three-model baseline consensus so that questions all three baselines answered correctly are dropped as insufficiently discriminating. One cell (repair/manufacturing × 300s) has zero clips and is reported as unpopulated. Three-annotator human IAA has not been performed.

01Step recognition. Fuzzy-match F1 against the specific annotated steps (not generic procedural templates).
02Temporal ordering. Accuracy on pairwise before/after questions drawn from step-boundary annotations.
03Causal reasoning. Accuracy on "why did the person do X?" and "what would happen if they skipped Y?" questions.
04Cross-modal grounding. Accuracy on questions whose answers appear in the video but NOT in the ASR transcript, forcing the model to attend to frames.
05Adversarial error detection. Detection rate on descriptions modified along one of eight error types (step swap, step omission, step modification, action modification, tool substitution, quantity error, causal reversal, insufficient input).

All scores are reported with 95% bootstrap confidence intervals (10,000 resamples; 1,000 for the paired single-frame ablation). We evaluate six frontier vision–language models — GPT-4o, GPT-4o-mini, Gemini 2.5 Flash, Claude Haiku 4.5, Sonnet 4.5, and Opus 4.5. Gemini 2.5 Pro is not evaluated; it returned 503 at a higher rate than Flash during our evaluation window. Gemini 2.5 Flash itself suffered a 30% 503 rate and is reported only on step recognition (the axis that completed before retries were exhausted). Claude Opus 4.5 runs at 4 frames per sample due to compute constraints; the other models run at 8 frames.

Multi-frame context doesn't measurably help

We re-ran the temporal-ordering and causal-reasoning axes at max_frames = 1 for the two models most relevant to the question — Claude Sonnet 4.5 (the leaderboard winner) and GPT-4o (the strongest non-Claude model with complete coverage). Every 1-frame response is paired with the corresponding 8-frame response on the identical (clip, question) pair, enabling a paired bootstrap on the score difference.

Model	Axis	n	1f	8f	Δ (1f − 8f)	Sig.
Sonnet 4.5	Temporal	90	0.365	0.387	−0.022	ns
Sonnet 4.5	Causal	225	0.491	0.503	−0.012	ns
GPT-4o	Temporal	90	0.203	0.218	−0.014	ns
GPT-4o	Causal	225	0.397	0.383	+0.014	p < 0.05

Table 1. Paired 1-frame vs 8-frame ablation on temporal ordering (n = 90) and causal reasoning (n = 225). Δ is per-item score difference (1f − 8f); positive means 1-frame better. Paired bootstrap, 1,000 resamples, seed 42.

For three of the four (model, axis) cells, the 8-frame condition is numerically slightly better but the 95% paired CI includes zero. For the fourth — GPT-4o on causal reasoning — the 1-frame condition is significantly better. We read this cautiously. Three candidate explanations are non-exclusive: the ASR transcript is included in both conditions and may already carry the temporal signal; our items may not be subtle enough to separate 1-frame from 8-frame evidence at current difficulty; and extra frames may occasionally introduce distractor content that hurts rather than helps. The key empirical takeaway is that we cannot reject the null of 1-frame = 8-frame on this pool, and any future claim that procedural reasoning requires multi-frame context must include an ablation like this one.

Any benchmark that claims to evaluate temporal reasoning by serving more frames should verify that the extra frames are actually being used. Our negative result suggests a substantial fraction of procedural items can be answered from transcript plus one representative frame.

Leaderboard and axis breakdown

Table 2 reports the full per-axis leaderboard. Claude Sonnet 4.5 wins the composite (unweighted mean across the five axes) at 0.446, narrowly ahead of Claude Haiku 4.5 (0.422) and Opus 4.5 (0.420). The Claude-family cluster sits 8–13 points above the GPT-4o family (mini 0.335, 4o 0.313). The Sonnet–Haiku gap (0.024) is within combined bootstrap uncertainty on several axes and we do not over-claim a tight ranking between them; what is robust is the Claude-family vs GPT-4o-family separation, and that separation is almost entirely driven by the error-detection axis. GPT-4o-mini outscores full GPT-4o on every axis reported — we do not have a clean explanation and flag it as worth a separate study.

Model	Step (F1)	Temporal	Causal	X-Modal	Error Det.	Composite
Claude Sonnet 4.5	0.048	0.387	0.503	0.343	0.947	0.446
Claude Haiku 4.5	0.036	0.308	0.487	0.342	0.936	0.422
Claude Opus 4.5	0.035	0.335	0.457	0.343	0.933	0.420
GPT-4o-mini	0.094	0.238	0.426	0.270	0.647	0.335
GPT-4o	0.104	0.218	0.383	0.223	0.639	0.313
Gemini 2.5 Flash	0.056	—	—	—	—	—

Table 2. VidWork-Bench leaderboard. Scores are per-axis mean accuracy/F1 (higher is better). Composite is the unweighted mean across the five axes, reported only for models with full coverage. Gemini 2.5 Flash completed only step recognition (30% 503 rate).

Step recognition is uniformly hard: no model exceeds 11% F1. The scorer requires fuzzy overlap against the specific annotated steps rather than accepting generic procedural templates ("gather supplies, perform procedure, verify result"), and the dominant failure mode across all models is template substitution — producing a plausible cooking/repair/first-aid boilerplate rather than the procedure-specific steps. Models know what procedures look like in the abstract but do not recover the procedure-specific step signal from our clips.

Error detection separates Claude from GPT-4o

The adversarial error-detection axis is the decisive axis in this benchmark. Table 3 reports per-error-type detection rates across the five models with full coverage. Across the six non-degenerate error types with meaningful sample size, Claude models detect adversarial errors at 85–97%; GPT-4o models detect at 38–76%. The largest gaps are on step omission (Haiku 96.6% vs GPT-4o 59.2%) and tool substitution (Opus 96.8% vs GPT-4o 53.2%). The gap on quantity error (91.7% vs 37.5–45.8%) is notable because quantity errors are arguably the subtlest category — they require verifying a specific numerical claim rather than catching a qualitative mismatch.

Error type	n	GPT-4o	GPT-4o-mini	Haiku 4.5	Sonnet 4.5	Opus 4.5
step_swap	428	0.729	0.755	0.956	0.967	0.967
step_omission	206	0.592	0.738	0.966	0.947	0.850
action_modification	181	0.597	0.481	0.895	0.934	0.945
tool_substitution	124	0.532	0.524	0.911	0.952	0.968
causal_reversal	123	0.602	0.537	0.911	0.911	0.911
quantity_error	24	0.458	0.375	0.917	0.875	0.917
Weighted mean	1,090	0.639	0.647	0.936	0.947	0.933

Table 3. Adversarial error-detection rate by error type. "Detection" = the model flags the adversarial description as incorrect. Best per row in italics in the paper; here the weighted-mean column is highlighted. The degenerate insufficient-input category (n = 1) and step_modification (n = 3) are saturated for every model and omitted here.

Two readings compete. The Claude-family gap could reflect genuinely more careful procedural scrutiny, or it could reflect a higher base rate of "flagging errors when prompted" — which would inflate detection rates at the cost of false positives on matched-correct descriptions. Disambiguating requires a symmetric correct-description counter-set where the described procedure is actually correct and the model must not flag. We do not have that counter-set in this release; constructing it is the highest-priority follow-up, and we treat the current number as a detection-plus-propensity composite.

Scope we don't claim

VidWork-Bench has four scope limits we document explicitly. Gemini 2.5 Pro returned persistent 503 errors during the evaluation window and is not evaluated. Gemini 2.5 Flash suffered a 30% 503 rate and is reported only on step recognition; its other four axes are blank rather than partial. Claude Opus 4.5 was run at 4 frames per sample instead of 8 due to compute constraints. The repair/manufacturing × 300s cell in the duration grid is empty — long enough repair clips with dense step annotations were not available during curation, so the cell is reported as unpopulated.

Two caveats affect how the composite ranking should be read. The causal-reasoning and temporal-ordering items are Claude-authored, which likely confers a small within-family advantage on those axes. The adversarial error-detection axis, which contributes most of the Claude-vs-GPT composite gap, is rule-based (perturbed from a correct description along one of eight categories) and is not subject to Claude-authoring bias. Human inter-annotator agreement has not been performed; the reference answers and the key-term-overlap scorer are the only labels. We release the full evaluation corpus, per-cell aggregates, and the single-frame ablation pool so that other labs can reproduce and extend these numbers.

Cite this work

@article{datoric03,
  title={VidWork-Bench: A five-axis benchmark for procedural video understanding},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.com/research/vidwork-bench}
}

Data sources

→YouCook2
→COIN
→Curated instructional content

№ 04Safety

VideoTruth-Bench

Cross-modal consistency verification across six contradiction levels

Read paper

№ 02Fairness

GlobalVoice-Bench

Measuring linguistic equity gaps in multilingual speech AI

Read paper