Why truth, not just accuracy
Most video benchmarks measure whether a model can correctly describe what is in a clip. VideoTruth-Bench measures something different: whether a model will tell you when it is being lied to about what is in a clip. This is the capability that matters for almost every high-stakes deployment of video AI. If a model is reviewing body camera footage, or medical imaging, or insurance claims, the question is not whether the model can describe the scene. The question is whether it can reliably contradict an incorrect description.
We care about this because models that hallucinate confidently are more dangerous than models that fail visibly. A model that says 'I am not sure' is useful. A model that confirms a false description with high confidence is an active liability.
Headline finding
Sycophancy is universal across the frontier. An adversarial prompt ("this caption has been verified as accurate…") reduces contradiction detection by 16.3 to 17.9 percentage points for every one of the seven models we tested — Gemini 2.5 Pro, GPT-4o, GPT-4o-mini, Gemini 2.5 Flash, Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5. The mean sycophancy gap is 17.1 pp. No model we tested is immune.
The benchmark
VideoTruth-Bench draws from five video datasets: ActivityNet Captions, VATEX, NeXT-QA, Charades, and MSR-VTT. For each clip we generate a correct caption and a contradictory caption, then present the model with each caption under three prompt variants: direct (just the caption), adversarial (the caption framed as verified by a trusted source), and indirect (the caption presented as a hypothesis to evaluate). We run the benchmark across seven frontier multimodal models — GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5 — covering four axes:
- 01Contradiction detection by subtlety level. A six-level taxonomy from L1 entity swap through L2 temporal, L3 quantitative, L4 attributive, L5 causal, and L6 omission.
- 02Sycophancy gap. The drop in contradiction detection between direct and adversarial prompt variants, plus a framing bias analysis (authority vs tentative, verbose vs terse, jargon vs plain).
- 03Hallucination refusal. Whether the model will refuse to answer a question about content that is not actually in the video.
- 04Calibration. Expected calibration error between the model's stated confidence and its empirical accuracy.
The sycophancy gap
The sycophancy gap is the most important number in this paper. It measures how much a model's contradiction detection degrades when the contradictory caption is framed as coming from an authoritative source rather than as a neutral statement. A model with a zero sycophancy gap is unaffected by framing. A model with a large sycophancy gap is effectively being steered by prompt phrasing rather than by what is actually in the video.
Figure
Direct framing
Adversarial framing
Gemini 2.5 Pro
gap 16.3pp
Claude Opus 4.5
gap 16.8pp
Claude Sonnet 4.5
gap 17.4pp
GPT-4o
gap 16.9pp
Gemini 2.5 Flash
gap 17.1pp
Claude Haiku 4.5
gap 17.9pp
GPT-4o-mini
gap 17.5pp
| Model | Direct | Indirect | Adversarial | Syc. Gap |
|---|---|---|---|---|
| Gemini 2.5 Pro | 71.1% | 76.4% | 54.8% | 16.3 |
| Claude Opus 4.5 | 70.1% | 75.2% | 53.3% | 16.8 |
| GPT-4o | 68.3% | 72.1% | 51.4% | 16.9 |
| Gemini 2.5 Flash | 65.8% | 69.2% | 48.7% | 17.1 |
| Claude Sonnet 4.5 | 69.5% | 74.8% | 52.1% | 17.4 |
| GPT-4o-mini | 60.7% | 63.8% | 43.2% | 17.5 |
| Claude Haiku 4.5 | 63.5% | 67.4% | 45.6% | 17.9 |
Sycophancy is not a small-model artifact and it is not a single-provider artifact. Gemini 2.5 Pro has the highest direct-framing accuracy and the smallest sycophancy gap (16.3 pp) in our evaluation, but even that gap is larger than the headline result of most language-model calibration papers. Claude Haiku 4.5 has the largest gap (17.9 pp). The indirect prompt ("describe first, then compare") consistently recovers 1–5 pp of detection across every model — a partial mitigation worth adopting in production prompts.
A follow-up framing bias analysis isolates three specific levers. Holding content constant and varying only the wrapping: confident phrasing vs tentative phrasing drops detection by 15.6 pp; verbose wrapping vs terse drops it by 9.7 pp; jargon vs plain language drops it by 8.8 pp. All three effects are statistically significant (paired bootstrap, p < 0.05). Authority — not size, not modality, not provider — is the active ingredient.
The worst thing a model can do is confidently agree with something that is not true. Sycophancy is the measurable name for that failure mode.
Contradiction taxonomy
We taxonomize contradictions into six levels of subtlety, from an obvious entity swap to a near-invisible omission. The taxonomy is important because contradiction detection scores mean nothing without knowing which contradictions the model was asked to catch. A model that catches entity swaps is table stakes. A model that catches omissions is a usable safety layer — and no model we tested reaches it.
| Model | L1 Entity | L2 Temp. | L3 Quant. | L4 Attr. | L5 Causal | L6 Omission | Overall F1 |
|---|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 93.7 | 78.4 | 74.6 | 72.1 | 61.7 | 43.2 | 71.1 |
| Claude Opus 4.5 | 93.1 | 77.3 | 73.5 | 71.0 | 60.2 | 43.0 | 70.1 |
| Claude Sonnet 4.5 | 92.8 | 76.2 | 72.9 | 70.4 | 59.1 | 42.7 | 69.5 |
| GPT-4o | 94.2 | 74.8 | 71.3 | 68.9 | 57.4 | 41.2 | 68.3 |
| Gemini 2.5 Flash | 91.4 | 72.1 | 68.7 | 66.3 | 54.8 | 38.6 | 65.8 |
| Claude Haiku 4.5 | 90.2 | 70.4 | 66.9 | 64.5 | 52.1 | 36.8 | 63.5 |
| GPT-4o-mini | 89.1 | 67.3 | 63.8 | 60.2 | 48.1 | 33.7 | 60.7 |
L6 omission is the hardest axis by a wide margin. Even Gemini 2.5 Pro, the best-performing model, detects only 43.2% of missing events — below the 50% random baseline. This matters because omission is the contradiction type most common in adversarially edited or selectively cropped video: it is not that the caption says something false, it is that the caption fails to mention a key action. The frontier cannot reliably catch this failure mode today.
Confidence calibration
A model can be accurate on average and still catastrophically miscalibrated. Calibration measures whether the model's stated confidence corresponds to its empirical accuracy. We report expected calibration error (ECE), alongside temporal ordering accuracy and hallucination refusal rate for each of the seven models.
| Model | Temporal Acc. | Halluc. Refusal | ECE |
|---|---|---|---|
| Gemini 2.5 Pro | 74.8% | 61.3% | 0.098 |
| Claude Opus 4.5 | 73.9% | 59.8% | 0.103 |
| Claude Sonnet 4.5 | 73.1% | 58.6% | 0.108 |
| GPT-4o | 71.4% | 54.8% | 0.142 |
| Gemini 2.5 Flash | 68.9% | 52.7% | 0.127 |
| Claude Haiku 4.5 | 66.5% | 48.3% | 0.135 |
| GPT-4o-mini | 63.7% | 41.2% | 0.178 |
Across all seven models, average self-reported confidence when giving an incorrect answer is 78.3% (±4.1%). When giving a correct answer, confidence is 84.7% (±3.2%). The gap is only 6.4 percentage points — models are nearly as confident in wrong answers as in correct ones. No model reaches ECE below 0.05, the threshold typically considered well-calibrated. Hallucination refusal is even more concerning: when asked about events that never occurred in the video, the best model (Gemini 2.5 Pro) refuses only 61.3% of the time, and GPT-4o-mini refuses only 41.2% of hallucination probes — fabricating answers to 58.8% of questions about non-existent content.
Safety implications
The sycophancy gap and the calibration failure compound each other. A model that agrees with authoritative framings and reports 50% confidence when wrong is a model that can be steered by prompt design and will not tell you when it has been steered. This is not a theoretical concern. It is a live failure mode in every deployment where a video model is asked to verify a claim made by a trusted source.
The practical implication for deployment is that video understanding models should not be treated as independent verifiers of claims that come with authority cues. Until the sycophancy gap is closed, any safety-critical pipeline needs an adversarial audit layer that strips authority framing before the video model sees the prompt. VideoTruth-Bench is designed to make the gap measurable, so that the audit layer has a target to close against.