All research
Paper № 01Evaluation11 min read

VoicePro-Bench

Evaluating frontier voice AI models on professional speech understanding

Published

February 24, 2026

Authors

Datoric Research

Dataset

200-sample evaluation subset. 12 models. Bootstrap 95% CI.

Abstract

Frontier voice models are marketed for professional deployment, yet their failure modes on real-world speech remain poorly characterized. We introduce VoicePro-Bench, a 200-sample English plus accented-language transcription evaluation across twelve frontier systems: five dedicated ASR providers (Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, ElevenLabs Scribe), four audio-native multimodal LLMs (GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, Gemini 2.5 Flash), and three Claude text-reasoner controls (Opus 4.5, Sonnet 4.5, Haiku 4.5) over reference transcripts. ElevenLabs Scribe is the strongest dedicated ASR at WER 0.408 and beats every audio-native multimodal LLM including Gemini 2.5 Pro (0.441); AssemblyAI Universal-2 has the lowest character error rate among ASR providers (CER 0.125). GPT-4o Audio is an outlier on reliability, with WER 0.624 and ~20% timeout rate on CJK inputs. A pre-fix integration cliff in Deepgram Nova-3 multilingual auto-detect inflated WER from 0.44 to 0.73 by silently returning empty strings on Mandarin and Japanese.

Headline

0.408

best WER, achieved by ElevenLabs Scribe, beating every audio-native multimodal LLM

Motivation

Voice models are increasingly deployed in call centers, clinical intake, legal discovery, and enterprise meeting tools. These contexts share three properties that public benchmarks rarely capture: noisy environments, accented and non-native speakers, and dense domain-specific jargon. Standard speech evaluations like LibriSpeech test-clean or Common Voice fluent splits do not measure whether a model can extract the correct intent from a frustrated customer speaking Spanish-inflected English about a medical billing issue.

VoicePro-Bench is our attempt to measure what professional deployment actually demands. Rather than optimizing a single summary score, we structure the benchmark around four independent axes that stress different capabilities, and we report every result with bootstrap confidence intervals so that claims of model superiority are statistically defensible.

Headline finding

ElevenLabs Scribe (WER 0.408) is the strongest system in the audio-facing tiers, beating every audio-native multimodal LLM including Gemini 2.5 Pro (0.441). AssemblyAI Universal-2 has the lowest CER (0.125) among dedicated ASR providers. GPT-4o Audio is the exception, with WER 0.624 and a ~20% timeout rate on CJK inputs.

The benchmark

VoicePro-Bench comprises curated audio samples drawn from FLEURS English (en-US) and VoxPopuli, plus six accented language variants (Hindi, Spanish, German, French, Mandarin, Japanese). We evaluate twelve frontier systems organized into three classes:

  • 01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
  • 02Audio-native multimodal LLMs (audio in, task-conditioned output). GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
  • 03Text reasoners over reference transcripts (text in, text out). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5, used as an upper-bound control measuring formatting and normalization cost rather than audio understanding.

Primary metrics are word error rate (WER) and character error rate (CER), reported with 95% bootstrap confidence intervals over 10,000 resamples. We additionally report a 500-error manual taxonomy and a per-difficulty-tier breakdown.

Headline results

Table 1 reports WER and CER for all twelve models. Within the audio-facing tiers, ElevenLabs Scribe leads on WER (0.408) and AssemblyAI Universal-2 leads on CER (0.125). The dedicated-ASR cluster spans only 0.024 WER from best to worst. Claude text reasoners on the unnormalized reference transcript define a floor (WER 0.385–0.395) attributable to formatting, punctuation, and casing. ElevenLabs Scribe lands within 0.02 WER of that text-only ceiling.

ModelClassWERCER
Whisper v3Dedicated ASR0.4320.137
Deepgram Nova-3Dedicated ASR0.4290.144
Deepgram Nova-2Dedicated ASR0.4320.144
AssemblyAI Univ.-2Dedicated ASR0.4270.125
ElevenLabs ScribeDedicated ASR0.4080.132
GPT-4o AudioAudio MLLM0.6240.453
GPT-4o-mini AudioAudio MLLM0.6110.334
Gemini 2.5 ProAudio MLLM0.4410.138
Gemini 2.5 FlashAudio MLLM0.4360.138
Claude Opus 4.5Text control0.3910.116
Claude Sonnet 4.5Text control0.3950.128
Claude Haiku 4.5Text control0.3850.106
Table 1. VoicePro-Bench transcription results, lower-is-better. WER and CER on a 200-sample evaluation subset across English and six accented language variants.

Figure 1

WER

0.0 to 1.0

ElevenLabs Scribe
0.41
AssemblyAI Univ.-2
0.43
Deepgram Nova-3
0.43
Whisper v3
0.43
Deepgram Nova-2
0.43
Gemini 2.5 Flash
0.44
Gemini 2.5 Pro
0.44
GPT-4o-mini Audio
0.61
GPT-4o Audio
0.62
Claude Haiku 4.5 (text)
0.39
Claude Opus 4.5 (text)
0.39
Claude Sonnet 4.5 (text)
0.40
Figure 1. WER (lower is better) across all twelve evaluated systems. The dedicated ASR tier clusters within 0.024 WER; Gemini 2.5 Pro/Flash are competitive but do not surpass ElevenLabs Scribe; GPT-4o Audio is the outlier.

Dedicated ASR still leads

ElevenLabs Scribe (WER 0.408) beats every audio-native multimodal LLM in our evaluation. Gemini 2.5 Flash (0.436) and Gemini 2.5 Pro (0.441) are competitive with the dedicated-ASR cluster but do not surpass the best production ASR. GPT-4o Audio (0.624) and GPT-4o-mini Audio (0.611) are dramatically worse. The narrative that frontier multimodal models subsume specialized ASR does not hold at production-evaluation scale.

Within the dedicated-ASR cluster, WER spans only 0.024 from best (ElevenLabs 0.408) to worst (Whisper v3 / Deepgram Nova-2 at 0.432). At this scale, the choice between top-five production ASR providers should be made on cost, latency, language coverage, and reliability rather than headline WER alone. AssemblyAI Universal-2 has the lowest character error rate (0.125), suggesting it preserves character-level fidelity even when word-level errors are marginally higher.

At production evaluation scale, the best frontier multimodal model does not beat the best dedicated ASR. The audio-native MLLM category carries no useful quality prior, either upward or downward.

GPT-4o audio's reliability tail

GPT-4o Audio is the worst audio-native model in our evaluation on both quality (WER 0.624) and reliability. Its confidence interval is extremely wide (0.344–1.098), reflecting catastrophic failures on a subset of samples. Across the full run we observed roughly 20% of CJK samples returning timeout or connection errors after the default 60-second client timeout. No dedicated ASR provider and no other audio-native model exhibits this failure pattern at comparable rates. For customers evaluating voice AI for production, this reliability tail is a more actionable signal than the mean-quality score.

We also surfaced a hidden integration cliff: before fixing our provider adapter, Deepgram Nova-3 returned WER 0.73 because its default multilingual auto-detect ("multi") covers only a small set of Western and Indic languages, silently returning empty strings for Mandarin and Japanese audio. Passing an explicit BCP-47 language code (e.g., zh for Mandarin) brought Nova-3 down to WER 0.44 and Chinese CER to 0.11. This kind of integration detail is not surfaced in the default SDK examples and likely explains a non-trivial fraction of multilingual quality complaints from existing Deepgram customers.

Where models break

We manually categorized 500 errors from the five worst-performing model–condition pairs to build a taxonomy of professional voice AI failures. The top three categories together account for 68% of errors, and share a common mechanism: ambiguous acoustic input combined with strong language-model priors that favor common words over domain terminology.

Failure modeShareExample
Domain term hallucination34%"metformin" → "met for men"
Noise-induced intent flip18%"cancel" → "can sell"
Accent entity corruption16%Account digits transposed
Emotion misclassification14%Urgency classified as neutral
Temporal context loss10%Multi-turn reference failure
Speaker confusion8%Speech attributed to wrong speaker
Table 2. Error taxonomy across 500 manually categorized errors.

Domain term hallucination is the dominant pathology. Models substitute familiar high-frequency words for unfamiliar technical terms when the acoustic evidence is ambiguous: "bilateral pneumothorax" becomes "by lateral new motor acts," "habeas corpus" becomes "have his corpse," "amortization schedule" becomes "a more decision schedule." These errors are particularly dangerous because the output is fluent and surface-level confident — automated grammaticality checks would not catch them.

Discussion

VoicePro-Bench's headline result is that specialized dedicated ASR still wins on production voice transcription. ElevenLabs Scribe beats every audio-native multimodal LLM. The dedicated-ASR cluster is tight enough (0.024 WER spread) that buying decisions should turn on cost, latency, language coverage, and reliability rather than headline WER. GPT-4o Audio is structurally worse than every dedicated ASR and should be treated as a reliability risk in production.

We are releasing the benchmark and evaluation harness so that other labs can reproduce these numbers and extend the axes. The next iteration will add longitudinal callers, code-switched utterances, and clinical recordings under patient consent.

Cite this work

@article{datoric01,
  title={VoicePro-Bench: Evaluating frontier voice AI models on professional speech understanding},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.ai/research/voicepro-bench}
}

Data sources

  • FLEURS (en-US, accented variants)
  • VoxPopuli

Continue reading