Why equity, not just accuracy
Public speech benchmarks typically report a single summary word error rate, aggregated across the languages the model happens to support. This obscures a critical property: the worst-served language is the one that matters most for a billion users. A model that averages a 3% word error rate but fails catastrophically on Javanese, Yoruba, or Amharic is not a multilingual model in any meaningful sense.
GlobalVoice-Bench reframes voice AI evaluation as a fairness problem. The headline metric is the ratio of word error rate between the worst and best resource tier, which we call the equity gap. A model with an equity gap of 1.0 serves every resource tier equally. A model with an equity gap of 5 serves high-resource speakers five times more accurately than low-resource speakers.
Headline finding
Three of five dedicated ASR providers (Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2) cannot serve low-resource African languages: Deepgram returns HTTP 400 errors on Swahili, Amharic, Hausa, Yoruba, Igbo, and Javanese, and AssemblyAI's WER exceeds 0.85. Only ElevenLabs Scribe (WER 0.409) and Gemini 2.5 Pro (WER 0.413) produce usable transcripts in this tier, statistically tied at the top.
The benchmark
GlobalVoice-Bench samples FLEURS audio across twenty languages organized into three resource tiers — seven high-resource (English, Mandarin, Spanish, French, German, Russian, Japanese), seven mid-resource, and six low-resource (predominantly African). We evaluate twelve frontier systems organized into three classes:
- 01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
- 02Audio-native multimodal LLMs. GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
- 03Text reasoners on reference transcripts (upper-bound control). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5.
Evaluation is performed along four axes: per-language WER (CER for CJK), code-switch boundary accuracy, accent sensitivity variance, and cultural comprehension QA. All scores carry 95% bootstrap confidence intervals.
The equity gap
Figure
High resource
Mid resource
Low resource
Whisper v3
gap 2.5x
Deepgram Nova-3
unsup.
Deepgram Nova-2
unsup.
AssemblyAI Univ.-2
gap 4.3x
ElevenLabs Scribe
gap 1.8x
GPT-4o Audio
gap 3.8x
GPT-4o-mini Audio
gap 2.8x
Gemini 2.5 Pro
gap 1.6x
Gemini 2.5 Flash
gap 2.1x
| Model | High | Mid | Low |
|---|---|---|---|
| Whisper v3 | 0.214 | 0.232 | 0.529 |
| Deepgram Nova-3 | 0.210 | 0.228 | unsup. |
| Deepgram Nova-2 | 0.213 | 0.246 | unsup. |
| AssemblyAI Univ.-2 | 0.201 | 0.217 | 0.856 |
| ElevenLabs Scribe | 0.230 | 0.217 | 0.409 |
| GPT-4o Audio | 0.742 | 0.456 | 2.829 |
| GPT-4o-mini Audio | 0.838 | 0.574 | 2.377 |
| Gemini 2.5 Pro | 0.257 | 0.198 | 0.413 |
| Gemini 2.5 Flash | 0.239 | 0.208 | 0.510 |
The tier gap is large and monotonic: even within the audio-native MLLMs that do support all tiers, Gemini 2.5 Pro shows roughly a 1.6× WER increase from high to low resource. For dedicated ASR, the four production providers cluster within 0.013 WER of each other on high-resource (0.201–0.214), so the choice between them is essentially a wash at the top of the resource hierarchy. Differences become decision-relevant only as we descend tiers.
Where models fail by language
GPT-4o Audio averages WER above 1.0 on low-resource (2.829) — the model effectively hallucinates more text than the reference contains. Its mini variant is similarly catastrophic. Gemini 2.5 Pro and Flash, in contrast, track dedicated ASR closely on high- and mid-resource. On low-resource, ElevenLabs Scribe (0.409) and Gemini 2.5 Pro (0.413) are statistically tied at the top. The audio-native MLLM category is not monolithic: provider-level differences inside the category are larger than the gap between the category and dedicated ASR.
Code-switching is a separate failure regime. Even Gemini 2.5 Pro produces a boundary WER of 35.7% in a ±50-character window around switch points, more than 5× its monolingual high-resource WER. The dominant failure mode is language collapse: the model ignores the switch and transcribes the entire boundary region in a single language, accounting for 31–44% of boundary errors across models.
| Model | Boundary WER ↓ | Lang ID Acc ↑ |
|---|---|---|
| Gemini 2.5 Pro | 0.357 | 71.2% |
| GPT-4o Audio | 0.382 | 67.3% |
| Gemini 2.5 Flash | 0.415 | 62.8% |
| Whisper v3 | 0.429 | 58.4% |
Accent variance and cultural reasoning
Within-language accent variance reveals that models treat different accents of the same language very differently. For English, WER standard deviation across accents ranges from 3.2 percentage points (GPT-4o Audio) to 8.7 (GPT-4o-mini Audio). For Arabic, accent variance is even larger: WER range (max minus min across accents) exceeds 15 percentage points for most models. A single summary WER hides this internal dispersion.
Figure
Transcription is the floor. Cultural reasoning is the ceiling. Models today are close to the floor in high-resource settings, blocked by HTTP 400 errors at the floor in low-resource settings, and far from the ceiling everywhere.
Implications
The linguistic equity gap is a training data story. The models we evaluated are not architecturally incapable of serving low-resource languages. They are incapable because the data they were trained on does not represent those languages fairly. Closing the gap requires intentional data work: collecting and curating speech and text from the languages that are currently underserved, and pairing that with cultural context annotations by native speakers.
GlobalVoice-Bench is the first axis in a longer measurement program. Future iterations will add code-switched utterances, speaker age and gender balance, and domain-specific terminology. The goal is to make equity gaps visible enough that they become unavoidable.