Paper № 01Evaluation11 min read

VoicePro-Bench

Evaluating frontier voice AI models on professional speech understanding

Published

February 24, 2026

Authors

Datoric Research

Dataset

Five axes: 760-sample FLEURS + VoxPopuli transcription (200 evaluated), 200 SLURP intent + entity, 200 MELD emotion, 248 AMI reasoning questions, 200-sample MUSAN/WHAM noise cliff at four SNR levels. 12 models. Bootstrap 95% CI.

Download PDF View PDF

Abstract

VoicePro-Bench is a five-axis benchmark for professional voice AI: transcription, SLURP intent and entity extraction, MELD emotion, AMI multi-turn reasoning, and MUSAN/WHAM noise robustness. We evaluate 12 systems on a 200-sample transcription subset (drawn from a 760-sample pool): five dedicated ASR providers (Whisper v3, Deepgram Nova-3, Nova-2, AssemblyAI Universal-2, ElevenLabs Scribe), four audio-native multimodal LLMs (GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, Gemini 2.5 Flash), and three Claude text-reasoner controls (Opus, Sonnet, Haiku 4.5) over reference transcripts. On transcription, ElevenLabs Scribe leads at WER 0.408 and is essentially matched by Gemini 2.5 Pro (0.411) on point estimate with heavily overlapping CIs — the first instance in our evaluation of an audio-native MLLM reaching statistical parity with a production-grade dedicated ASR at full coverage. AssemblyAI Universal-2 has the lowest CER among ASR providers (0.125). The decisive results are downstream: on SLURP intent the best text-reasoner control beats the best audio-native MLLM by 27 pp F1, and on AMI reasoning the gap is 12.4 pp token-F1, locating audio understanding (not text reasoning) as the dominant source of error. GPT-4o Audio refuses 20% of valid SLURP transcription requests and carries a bimodal transcription distribution (WER 0.612, CI 0.464–0.795) driven by a heavy CJK tail. Gemini 2.5 Flash posts a 20-to-0 dB WER jump of +0.073 under MUSAN/WHAM noise, roughly 1.5× the steepest dedicated-ASR cliff.

Headline

27 pp

SLURP intent F1 gap: best text-reasoner control over the best audio-native MLLM — audio understanding is the dominant source of error

Motivation

Voice models are increasingly deployed in call centers, clinical intake, legal discovery, and enterprise meeting tools. These contexts share properties that public benchmarks rarely capture: noisy environments, accented and non-native speakers, dense domain jargon, and the need to classify intent and extract entities rather than just emit a transcript. Standard speech evaluations do not tell us whether a model can do the downstream work that a deployment actually pays for.

VoicePro-Bench is structured around five independent axes — transcription, SLURP intent and entity extraction, MELD emotion, AMI multi-turn reasoning, and MUSAN/WHAM noise robustness — with three model classes (dedicated ASR, audio-native MLLM, and a Claude text-reasoner control on the reference transcript) scored under matched conditions. The text-reasoner control is the lever that lets us isolate audio understanding from reasoning capacity: when the same model class answers the same questions from raw audio and from the reference transcript, the difference is attributable to audio, not to prompting.

Main result

On SLURP intent the best text-reasoner control (Claude Opus 4.5, F1 0.748) outperforms the best audio-native MLLM (Gemini 2.5 Pro, F1 0.479) by 27 pp. On AMI reasoning Claude Haiku 4.5 (text, F1 0.636) leads the best audio-native MLLM by 12.4 pp token-F1. Same questions, same answers — audio understanding is the dominant source of error, not reasoning capacity.

The benchmark

The transcription axis uses 760 curated audio samples drawn from FLEURS English (en-US), VoxPopuli, and six accented language variants (Hindi, Spanish, German, French, Mandarin, Japanese), with a 200-sample balanced subset for the reported model evaluation. The other four axes draw on task-specific corpora: 200 SLURP samples for intent and entity extraction, 200 MELD samples for 7-class emotion, 248 AMI-grounded questions for multi-turn reasoning, and a 200-sample MUSAN/WHAM noise-cliff at four SNR levels (20, 10, 5, 0 dB). We evaluate twelve frontier systems organized into three classes:

01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
02Audio-native multimodal LLMs (audio in, task-conditioned output). GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
03Text reasoners over reference transcripts (text in, text out). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5, used as an upper-bound control that isolates reasoning quality from audio understanding.

Primary metrics are WER and CER on transcription (95% bootstrap confidence intervals, 10,000 resamples), weighted F1 on intent and emotion, set-F1 on entities, token-level F1 on AMI reasoning, and WER-under-noise deltas on the MUSAN/WHAM cliff. On SLURP, dedicated ASR providers (which cannot emit intent labels directly) run an ASR→Claude-Haiku cascade so the comparison is against what a provider-grade deployment would actually produce.

Headline results

Table 1 reports WER and CER for all twelve models on the 200-sample transcription subset, post the 2026-04-22 rate-limit backfill that recovered the audio-native MLLMs to full coverage. Within the audio-facing tiers, ElevenLabs Scribe leads on WER (0.408) and is matched by Gemini 2.5 Pro (0.411) at overlapping CIs — a point-estimate tie at the top. AssemblyAI Universal-2 leads on CER (0.125). The dedicated-ASR cluster spans only 0.024 WER from best to worst. Claude text reasoners on the reference transcript define a ceiling (WER 0.385–0.395) attributable to formatting, punctuation, and casing normalization — not audio. ElevenLabs Scribe and Gemini 2.5 Pro both land within 0.03 WER of that text-only ceiling.

Model	Class	WER	CER
Whisper v3	Dedicated ASR	0.432	0.137
Deepgram Nova-3	Dedicated ASR	0.429	0.144
Deepgram Nova-2	Dedicated ASR	0.432	0.144
AssemblyAI Univ.-2	Dedicated ASR	0.427	0.125
ElevenLabs Scribe	Dedicated ASR	0.408	0.132
GPT-4o Audio	Audio MLLM	0.612	0.308
GPT-4o-mini Audio	Audio MLLM	0.611	0.334
Gemini 2.5 Pro	Audio MLLM	0.411	0.132
Gemini 2.5 Flash	Audio MLLM	0.429	0.143
Claude Opus 4.5	Text control	0.391	0.116
Claude Sonnet 4.5	Text control	0.395	0.128
Claude Haiku 4.5	Text control	0.385	0.106

Table 1. VoicePro-Bench transcription results, lower-is-better. WER and CER on a 200-sample evaluation subset across English and six accented language variants.

Figure 1

WER

0.0 to 1.0

ElevenLabs Scribe

0.41

Gemini 2.5 Pro

0.41

AssemblyAI Univ.-2

0.43

Deepgram Nova-3

0.43

Gemini 2.5 Flash

0.43

Whisper v3

0.43

Deepgram Nova-2

0.43

GPT-4o-mini Audio

0.61

GPT-4o Audio

0.61

Claude Haiku 4.5 (text)

0.39

Claude Opus 4.5 (text)

0.39

Claude Sonnet 4.5 (text)

0.40

Figure 1. WER (lower is better) across all twelve evaluated systems, post the 2026-04-22 rate-limit backfill on the four audio-native MLLMs. ElevenLabs Scribe and Gemini 2.5 Pro are tied at the top of the audio-facing tier on point estimate (0.408 vs 0.411) with overlapping CIs; the rest of the dedicated-ASR tier clusters within 0.024 WER; GPT-4o Audio and its mini variant remain outliers.

Scribe and Gemini 2.5 Pro tie at the top

After the 2026-04-22 rate-limit backfill brought the audio-native MLLMs to full 200-sample coverage, Gemini 2.5 Pro (WER 0.411) essentially matches ElevenLabs Scribe (0.408) at the top of the transcription axis: the 95% bootstrap CIs are 0.366–0.458 and 0.365–0.454 respectively, a statistical tie. Gemini 2.5 Flash (0.429) sits mid-pack inside the dedicated-ASR cluster. GPT-4o Audio (0.612) and GPT-4o-mini Audio (0.611) remain dramatically worse. The simpler earlier story — that specialized ASR subsumes audio-native MLLMs on transcription — does not hold at full coverage on our 200-sample subset; one frontier MLLM has reached parity, while the rest of the audio-native tier still lags.

Within the dedicated-ASR cluster, WER spans only 0.024 from best (ElevenLabs 0.408) to worst (Whisper v3 / Deepgram Nova-2 at 0.432). At this scale, the choice between top-five production ASR providers should be made on cost, latency, language coverage, and reliability rather than headline WER alone. AssemblyAI Universal-2 has the lowest character error rate (0.125), suggesting it preserves character-level fidelity even when word-level errors are marginally higher.

On transcription the best frontier multimodal model has drawn level with the best dedicated ASR. On intent and reasoning, the text-reasoner control on the reference transcript still wins by double digits — audio understanding, not reasoning, remains the dominant source of error on today's downstream voice tasks.

Refusals, timeouts, and noise cliffs

Three deployment-visible failure modes show up in the axes beyond transcription. First, GPT-4o Audio refuses 20% of valid SLURP transcription requests (40 of 200), with benign user utterances like "play radio station 101.9" treated as actions the model cannot perform; GPT-4o-mini Audio refuses 26.5%. This is a prompt-following failure in audio mode, not an audio-understanding failure, and it would not appear on a leaderboard that filters non-responses.

Second, at full 200-sample coverage GPT-4o Audio's transcription WER is 0.612 with a 95% bootstrap CI of 0.464–0.795 — CI width 0.33, roughly 4× Gemini 2.5 Pro's 0.09. This is no longer a sample-size artifact; it reflects a genuinely bimodal distribution with a heavy CJK tail alongside a sizeable fraction of clean transcriptions. Third, under additive MUSAN/WHAM noise, Gemini 2.5 Flash posts a 20-to-0 dB WER jump of +0.073 — roughly 1.5× the steepest dedicated-ASR cliff (Deepgram Nova-2, +0.049) — with the degradation coming from faithful-transcription quality loss rather than refusal. Over 95% of Gemini 2.5 Flash's 0-dB outputs are still transcription attempts, so the cliff is a pure quality effect.

A separate integration cliff: before we passed explicit BCP-47 language codes to Deepgram Nova-3, its default multilingual auto-detect covered only a small set of Western and Indic languages and silently returned empty strings for Mandarin and Japanese. Passing an explicit code (zh, ja) brought Nova-3 back into line with the other dedicated ASR providers. This kind of integration detail is not surfaced in the default SDK examples and likely explains a non-trivial fraction of multilingual quality complaints from Deepgram customers.

Discussion

VoicePro-Bench's main result is that audio understanding, not reasoning, accounts for the downstream-task gap on today's frontier models. Claude Opus 4.5 as a text-reasoner control reaches SLURP intent F1 0.748 on the reference transcript; the best audio-native MLLM reaches 0.479. Claude Haiku 4.5 on AMI meeting transcripts reaches token-F1 0.636; GPT-4o Audio and Gemini 2.5 Pro on raw audio reach 0.509–0.512. Same questions, same answers — when the audio is replaced by the ground-truth transcript the gap closes.

On transcription the dedicated-ASR cluster is tight (0.024 WER spread) and now joined at the top by Gemini 2.5 Pro (0.411, statistical tie with ElevenLabs Scribe), so provider-selection decisions should turn on cost, latency, language coverage, and reliability rather than headline WER. On downstream tasks, the ASR→Claude cascade outperforms every audio-native MLLM we evaluated on SLURP intent and entity F1. Emotion (MELD 7-class) is not yet production-usable on any audio-native model: the best score is Gemini 2.5 Pro at F1 0.443, roughly 2–3× the random baseline but well below any usable routing signal.

We release the benchmark and evaluation harness so that other labs can reproduce these numbers. Per-axis sample sizes are informative for aggregate rankings but too small for per-accent or per-scenario claims; we do not make those claims in the paper body.

Cite this work

@article{datoric01,
  title={VoicePro-Bench: Evaluating frontier voice AI models on professional speech understanding},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.com/research/voicepro-bench}
}

Data sources

→FLEURS (en-US, accented variants)
→VoxPopuli
→SLURP
→MELD
→AMI Meeting Corpus
→MUSAN
→WHAM

№ 04Safety

VideoTruth-Bench

Cross-modal consistency verification across six contradiction levels

Read paper

№ 03Video

VidWork-Bench

A five-axis benchmark for procedural video understanding

Read paper