All research
Paper № 04Safety14 min read

VideoTruth-Bench

Measuring sycophancy, hallucination, and calibration in video understanding models

Published

April 7, 2026

Authors

Datoric Research

Dataset

7 frontier video models. Six-level contradiction taxonomy. Bootstrap 95% CI.

Abstract

We introduce VideoTruth-Bench, a safety benchmark for video understanding models across seven frontier systems: GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude Opus 4.5, Claude Sonnet 4.5, and Claude Haiku 4.5. The benchmark measures contradiction detection across a six-level subtlety taxonomy (L1 entity swap through L6 omission), temporal ordering, hallucination refusal, and confidence calibration. Our headline finding is a universal sycophancy effect: authoritative framing of incorrect captions reduces contradiction detection by an average of 17.1 percentage points across all seven models, with per-model gaps of 16.3 to 17.9 pp. Even the best-performing model (Gemini 2.5 Pro) detects only 43.2% of omissions and refuses to hallucinate about absent events just 61.3% of the time. All models are confidently wrong: average self-reported confidence on incorrect answers is 78.3%, only 6.4 pp below confidence on correct answers.

Headline

17.1pp

average sycophancy gap across the frontier — detection collapses under authoritative framing

Why truth, not just accuracy

Most video benchmarks measure whether a model can correctly describe what is in a clip. VideoTruth-Bench measures something different: whether a model will tell you when it is being lied to about what is in a clip. This is the capability that matters for almost every high-stakes deployment of video AI. If a model is reviewing body camera footage, or medical imaging, or insurance claims, the question is not whether the model can describe the scene. The question is whether it can reliably contradict an incorrect description.

We care about this because models that hallucinate confidently are more dangerous than models that fail visibly. A model that says 'I am not sure' is useful. A model that confirms a false description with high confidence is an active liability.

Headline finding

Sycophancy is universal across the frontier. An adversarial prompt ("this caption has been verified as accurate…") reduces contradiction detection by 16.3 to 17.9 percentage points for every one of the seven models we tested — Gemini 2.5 Pro, GPT-4o, GPT-4o-mini, Gemini 2.5 Flash, Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5. The mean sycophancy gap is 17.1 pp. No model we tested is immune.

The benchmark

VideoTruth-Bench draws from five video datasets: ActivityNet Captions, VATEX, NeXT-QA, Charades, and MSR-VTT. For each clip we generate a correct caption and a contradictory caption, then present the model with each caption under three prompt variants: direct (just the caption), adversarial (the caption framed as verified by a trusted source), and indirect (the caption presented as a hypothesis to evaluate). We run the benchmark across seven frontier multimodal models — GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5 — covering four axes:

  • 01Contradiction detection by subtlety level. A six-level taxonomy from L1 entity swap through L2 temporal, L3 quantitative, L4 attributive, L5 causal, and L6 omission.
  • 02Sycophancy gap. The drop in contradiction detection between direct and adversarial prompt variants, plus a framing bias analysis (authority vs tentative, verbose vs terse, jargon vs plain).
  • 03Hallucination refusal. Whether the model will refuse to answer a question about content that is not actually in the video.
  • 04Calibration. Expected calibration error between the model's stated confidence and its empirical accuracy.

The sycophancy gap

The sycophancy gap is the most important number in this paper. It measures how much a model's contradiction detection degrades when the contradictory caption is framed as coming from an authoritative source rather than as a neutral statement. A model with a zero sycophancy gap is unaffected by framing. A model with a large sycophancy gap is effectively being steered by prompt phrasing rather than by what is actually in the video.

Figure

Direct framing

Adversarial framing

Gemini 2.5 Pro

gap 16.3pp

.711
.548

Claude Opus 4.5

gap 16.8pp

.701
.533

Claude Sonnet 4.5

gap 17.4pp

.695
.521

GPT-4o

gap 16.9pp

.683
.514

Gemini 2.5 Flash

gap 17.1pp

.658
.487

Claude Haiku 4.5

gap 17.9pp

.635
.456

GPT-4o-mini

gap 17.5pp

.607
.432
Figure 1. Contradiction detection under direct versus adversarial framing across all seven frontier models. The gap between the two bars is the sycophancy effect. All seven models lose 16.3–17.9 percentage points.
ModelDirectIndirectAdversarialSyc. Gap
Gemini 2.5 Pro71.1%76.4%54.8%16.3
Claude Opus 4.570.1%75.2%53.3%16.8
GPT-4o68.3%72.1%51.4%16.9
Gemini 2.5 Flash65.8%69.2%48.7%17.1
Claude Sonnet 4.569.5%74.8%52.1%17.4
GPT-4o-mini60.7%63.8%43.2%17.5
Claude Haiku 4.563.5%67.4%45.6%17.9
Table 1. Detection rates by prompt variant across all seven models. The sycophancy gap is direct minus adversarial. Lower is better for the gap column; the paper mean across models is 17.1 pp.

Sycophancy is not a small-model artifact and it is not a single-provider artifact. Gemini 2.5 Pro has the highest direct-framing accuracy and the smallest sycophancy gap (16.3 pp) in our evaluation, but even that gap is larger than the headline result of most language-model calibration papers. Claude Haiku 4.5 has the largest gap (17.9 pp). The indirect prompt ("describe first, then compare") consistently recovers 1–5 pp of detection across every model — a partial mitigation worth adopting in production prompts.

A follow-up framing bias analysis isolates three specific levers. Holding content constant and varying only the wrapping: confident phrasing vs tentative phrasing drops detection by 15.6 pp; verbose wrapping vs terse drops it by 9.7 pp; jargon vs plain language drops it by 8.8 pp. All three effects are statistically significant (paired bootstrap, p < 0.05). Authority — not size, not modality, not provider — is the active ingredient.

The worst thing a model can do is confidently agree with something that is not true. Sycophancy is the measurable name for that failure mode.

Contradiction taxonomy

We taxonomize contradictions into six levels of subtlety, from an obvious entity swap to a near-invisible omission. The taxonomy is important because contradiction detection scores mean nothing without knowing which contradictions the model was asked to catch. A model that catches entity swaps is table stakes. A model that catches omissions is a usable safety layer — and no model we tested reaches it.

ModelL1 EntityL2 Temp.L3 Quant.L4 Attr.L5 CausalL6 OmissionOverall F1
Gemini 2.5 Pro93.778.474.672.161.743.271.1
Claude Opus 4.593.177.373.571.060.243.070.1
Claude Sonnet 4.592.876.272.970.459.142.769.5
GPT-4o94.274.871.368.957.441.268.3
Gemini 2.5 Flash91.472.168.766.354.838.665.8
Claude Haiku 4.590.270.466.964.552.136.863.5
GPT-4o-mini89.167.363.860.248.133.760.7
Table 2. Detection accuracy by contradiction level across all seven models. All numbers are percent; random baseline is 50% (binary classification). Levels 3–5 provide the strongest model separation (22–24 pp range); L6 is the hardest axis.

L6 omission is the hardest axis by a wide margin. Even Gemini 2.5 Pro, the best-performing model, detects only 43.2% of missing events — below the 50% random baseline. This matters because omission is the contradiction type most common in adversarially edited or selectively cropped video: it is not that the caption says something false, it is that the caption fails to mention a key action. The frontier cannot reliably catch this failure mode today.

Confidence calibration

A model can be accurate on average and still catastrophically miscalibrated. Calibration measures whether the model's stated confidence corresponds to its empirical accuracy. We report expected calibration error (ECE), alongside temporal ordering accuracy and hallucination refusal rate for each of the seven models.

ModelTemporal Acc.Halluc. RefusalECE
Gemini 2.5 Pro74.8%61.3%0.098
Claude Opus 4.573.9%59.8%0.103
Claude Sonnet 4.573.1%58.6%0.108
GPT-4o71.4%54.8%0.142
Gemini 2.5 Flash68.9%52.7%0.127
Claude Haiku 4.566.5%48.3%0.135
GPT-4o-mini63.7%41.2%0.178
Table 3. Temporal ordering, hallucination refusal (higher is better — fraction of probes about absent events that are correctly refused), and expected calibration error (ECE, lower is better).

Across all seven models, average self-reported confidence when giving an incorrect answer is 78.3% (±4.1%). When giving a correct answer, confidence is 84.7% (±3.2%). The gap is only 6.4 percentage points — models are nearly as confident in wrong answers as in correct ones. No model reaches ECE below 0.05, the threshold typically considered well-calibrated. Hallucination refusal is even more concerning: when asked about events that never occurred in the video, the best model (Gemini 2.5 Pro) refuses only 61.3% of the time, and GPT-4o-mini refuses only 41.2% of hallucination probes — fabricating answers to 58.8% of questions about non-existent content.

Safety implications

The sycophancy gap and the calibration failure compound each other. A model that agrees with authoritative framings and reports 50% confidence when wrong is a model that can be steered by prompt design and will not tell you when it has been steered. This is not a theoretical concern. It is a live failure mode in every deployment where a video model is asked to verify a claim made by a trusted source.

The practical implication for deployment is that video understanding models should not be treated as independent verifiers of claims that come with authority cues. Until the sycophancy gap is closed, any safety-critical pipeline needs an adversarial audit layer that strips authority framing before the video model sees the prompt. VideoTruth-Bench is designed to make the gap measurable, so that the audit layer has a target to close against.

Cite this work

@article{datoric04,
  title={VideoTruth-Bench: Measuring sycophancy, hallucination, and calibration in video understanding models},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.ai/research/videotruth-bench}
}

Data sources

  • ActivityNet Captions
  • VATEX
  • NeXT-QA
  • Charades
  • MSR-VTT

Continue reading