Motivation
Voice models are increasingly deployed in call centers, clinical intake, legal discovery, and enterprise meeting tools. These contexts share three properties that public benchmarks rarely capture: noisy environments, accented and non-native speakers, and dense domain-specific jargon. Standard speech evaluations like LibriSpeech test-clean or Common Voice fluent splits do not measure whether a model can extract the correct intent from a frustrated customer speaking Spanish-inflected English about a medical billing issue.
VoicePro-Bench is our attempt to measure what professional deployment actually demands. Rather than optimizing a single summary score, we structure the benchmark around four independent axes that stress different capabilities, and we report every result with bootstrap confidence intervals so that claims of model superiority are statistically defensible.
Headline finding
ElevenLabs Scribe (WER 0.408) is the strongest system in the audio-facing tiers, beating every audio-native multimodal LLM including Gemini 2.5 Pro (0.441). AssemblyAI Universal-2 has the lowest CER (0.125) among dedicated ASR providers. GPT-4o Audio is the exception, with WER 0.624 and a ~20% timeout rate on CJK inputs.
The benchmark
VoicePro-Bench comprises curated audio samples drawn from FLEURS English (en-US) and VoxPopuli, plus six accented language variants (Hindi, Spanish, German, French, Mandarin, Japanese). We evaluate twelve frontier systems organized into three classes:
- 01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
- 02Audio-native multimodal LLMs (audio in, task-conditioned output). GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
- 03Text reasoners over reference transcripts (text in, text out). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5, used as an upper-bound control measuring formatting and normalization cost rather than audio understanding.
Primary metrics are word error rate (WER) and character error rate (CER), reported with 95% bootstrap confidence intervals over 10,000 resamples. We additionally report a 500-error manual taxonomy and a per-difficulty-tier breakdown.
Headline results
Table 1 reports WER and CER for all twelve models. Within the audio-facing tiers, ElevenLabs Scribe leads on WER (0.408) and AssemblyAI Universal-2 leads on CER (0.125). The dedicated-ASR cluster spans only 0.024 WER from best to worst. Claude text reasoners on the unnormalized reference transcript define a floor (WER 0.385–0.395) attributable to formatting, punctuation, and casing. ElevenLabs Scribe lands within 0.02 WER of that text-only ceiling.
| Model | Class | WER | CER |
|---|---|---|---|
| Whisper v3 | Dedicated ASR | 0.432 | 0.137 |
| Deepgram Nova-3 | Dedicated ASR | 0.429 | 0.144 |
| Deepgram Nova-2 | Dedicated ASR | 0.432 | 0.144 |
| AssemblyAI Univ.-2 | Dedicated ASR | 0.427 | 0.125 |
| ElevenLabs Scribe | Dedicated ASR | 0.408 | 0.132 |
| GPT-4o Audio | Audio MLLM | 0.624 | 0.453 |
| GPT-4o-mini Audio | Audio MLLM | 0.611 | 0.334 |
| Gemini 2.5 Pro | Audio MLLM | 0.441 | 0.138 |
| Gemini 2.5 Flash | Audio MLLM | 0.436 | 0.138 |
| Claude Opus 4.5 | Text control | 0.391 | 0.116 |
| Claude Sonnet 4.5 | Text control | 0.395 | 0.128 |
| Claude Haiku 4.5 | Text control | 0.385 | 0.106 |
Figure 1
WER
0.0 to 1.0
Dedicated ASR still leads
ElevenLabs Scribe (WER 0.408) beats every audio-native multimodal LLM in our evaluation. Gemini 2.5 Flash (0.436) and Gemini 2.5 Pro (0.441) are competitive with the dedicated-ASR cluster but do not surpass the best production ASR. GPT-4o Audio (0.624) and GPT-4o-mini Audio (0.611) are dramatically worse. The narrative that frontier multimodal models subsume specialized ASR does not hold at production-evaluation scale.
Within the dedicated-ASR cluster, WER spans only 0.024 from best (ElevenLabs 0.408) to worst (Whisper v3 / Deepgram Nova-2 at 0.432). At this scale, the choice between top-five production ASR providers should be made on cost, latency, language coverage, and reliability rather than headline WER alone. AssemblyAI Universal-2 has the lowest character error rate (0.125), suggesting it preserves character-level fidelity even when word-level errors are marginally higher.
At production evaluation scale, the best frontier multimodal model does not beat the best dedicated ASR. The audio-native MLLM category carries no useful quality prior, either upward or downward.
GPT-4o audio's reliability tail
GPT-4o Audio is the worst audio-native model in our evaluation on both quality (WER 0.624) and reliability. Its confidence interval is extremely wide (0.344–1.098), reflecting catastrophic failures on a subset of samples. Across the full run we observed roughly 20% of CJK samples returning timeout or connection errors after the default 60-second client timeout. No dedicated ASR provider and no other audio-native model exhibits this failure pattern at comparable rates. For customers evaluating voice AI for production, this reliability tail is a more actionable signal than the mean-quality score.
We also surfaced a hidden integration cliff: before fixing our provider adapter, Deepgram Nova-3 returned WER 0.73 because its default multilingual auto-detect ("multi") covers only a small set of Western and Indic languages, silently returning empty strings for Mandarin and Japanese audio. Passing an explicit BCP-47 language code (e.g., zh for Mandarin) brought Nova-3 down to WER 0.44 and Chinese CER to 0.11. This kind of integration detail is not surfaced in the default SDK examples and likely explains a non-trivial fraction of multilingual quality complaints from existing Deepgram customers.
Where models break
We manually categorized 500 errors from the five worst-performing model–condition pairs to build a taxonomy of professional voice AI failures. The top three categories together account for 68% of errors, and share a common mechanism: ambiguous acoustic input combined with strong language-model priors that favor common words over domain terminology.
| Failure mode | Share | Example |
|---|---|---|
| Domain term hallucination | 34% | "metformin" → "met for men" |
| Noise-induced intent flip | 18% | "cancel" → "can sell" |
| Accent entity corruption | 16% | Account digits transposed |
| Emotion misclassification | 14% | Urgency classified as neutral |
| Temporal context loss | 10% | Multi-turn reference failure |
| Speaker confusion | 8% | Speech attributed to wrong speaker |
Domain term hallucination is the dominant pathology. Models substitute familiar high-frequency words for unfamiliar technical terms when the acoustic evidence is ambiguous: "bilateral pneumothorax" becomes "by lateral new motor acts," "habeas corpus" becomes "have his corpse," "amortization schedule" becomes "a more decision schedule." These errors are particularly dangerous because the output is fluent and surface-level confident — automated grammaticality checks would not catch them.
Discussion
VoicePro-Bench's headline result is that specialized dedicated ASR still wins on production voice transcription. ElevenLabs Scribe beats every audio-native multimodal LLM. The dedicated-ASR cluster is tight enough (0.024 WER spread) that buying decisions should turn on cost, latency, language coverage, and reliability rather than headline WER. GPT-4o Audio is structurally worse than every dedicated ASR and should be treated as a reliability risk in production.
We are releasing the benchmark and evaluation harness so that other labs can reproduce these numbers and extend the axes. The next iteration will add longitudinal callers, code-switched utterances, and clinical recordings under patient consent.