All research
Paper № 02Fairness12 min read

GlobalVoice-Bench

Measuring linguistic equity gaps in multilingual speech AI

Published

March 10, 2026

Authors

Datoric Research

Dataset

FLEURS samples across 20 languages, 3 resource tiers. 12 frontier systems.

Abstract

We quantify the linguistic equity gap in frontier voice AI across twelve systems: five dedicated ASR providers (Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, ElevenLabs Scribe), four audio-native multimodal LLMs (GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, Gemini 2.5 Flash), and three Claude text-reasoner controls (Opus 4.5, Sonnet 4.5, Haiku 4.5). Three of five dedicated ASR providers — Deepgram Nova-3, Deepgram Nova-2, and AssemblyAI Universal-2 — do not produce usable transcripts for low-resource African languages (Swahili, Amharic, Hausa, Yoruba, Igbo, Javanese); Deepgram returns HTTP 400 errors and AssemblyAI returns WER above 0.85. Only ElevenLabs Scribe (WER 0.409) and Gemini 2.5 Pro (WER 0.413) work in this tier, statistically tied at the top. GPT-4o Audio averages WER above 1.0 on low-resource, hallucinating more text than the reference contains.

Headline

3 of 5

dedicated ASR providers cannot serve low-resource African languages

Why equity, not just accuracy

Public speech benchmarks typically report a single summary word error rate, aggregated across the languages the model happens to support. This obscures a critical property: the worst-served language is the one that matters most for a billion users. A model that averages a 3% word error rate but fails catastrophically on Javanese, Yoruba, or Amharic is not a multilingual model in any meaningful sense.

GlobalVoice-Bench reframes voice AI evaluation as a fairness problem. The headline metric is the ratio of word error rate between the worst and best resource tier, which we call the equity gap. A model with an equity gap of 1.0 serves every resource tier equally. A model with an equity gap of 5 serves high-resource speakers five times more accurately than low-resource speakers.

Headline finding

Three of five dedicated ASR providers (Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2) cannot serve low-resource African languages: Deepgram returns HTTP 400 errors on Swahili, Amharic, Hausa, Yoruba, Igbo, and Javanese, and AssemblyAI's WER exceeds 0.85. Only ElevenLabs Scribe (WER 0.409) and Gemini 2.5 Pro (WER 0.413) produce usable transcripts in this tier, statistically tied at the top.

The benchmark

GlobalVoice-Bench samples FLEURS audio across twenty languages organized into three resource tiers — seven high-resource (English, Mandarin, Spanish, French, German, Russian, Japanese), seven mid-resource, and six low-resource (predominantly African). We evaluate twelve frontier systems organized into three classes:

  • 01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
  • 02Audio-native multimodal LLMs. GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
  • 03Text reasoners on reference transcripts (upper-bound control). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5.

Evaluation is performed along four axes: per-language WER (CER for CJK), code-switch boundary accuracy, accent sensitivity variance, and cultural comprehension QA. All scores carry 95% bootstrap confidence intervals.

The equity gap

Figure

High resource

Mid resource

Low resource

Whisper v3

gap 2.5x

.214
.232
.529

Deepgram Nova-3

unsup.

.210
.228
unsup.

Deepgram Nova-2

unsup.

.213
.246
unsup.

AssemblyAI Univ.-2

gap 4.3x

.201
.217
.856

ElevenLabs Scribe

gap 1.8x

.230
.217
.409

GPT-4o Audio

gap 3.8x

.742
.456
2.829

GPT-4o-mini Audio

gap 2.8x

.838
.574
2.377

Gemini 2.5 Pro

gap 1.6x

.257
.198
.413

Gemini 2.5 Flash

gap 2.1x

.239
.208
.510
Figure 1. WER by resource tier across the nine audio-facing systems (lower is better). Three dedicated ASR providers cannot serve low-resource languages at all; their low-resource bars are unsupported. GPT-4o Audio's bars exceed 1.0 because the model hallucinates more text than the reference contains.
ModelHighMidLow
Whisper v30.2140.2320.529
Deepgram Nova-30.2100.228unsup.
Deepgram Nova-20.2130.246unsup.
AssemblyAI Univ.-20.2010.2170.856
ElevenLabs Scribe0.2300.2170.409
GPT-4o Audio0.7420.4562.829
GPT-4o-mini Audio0.8380.5742.377
Gemini 2.5 Pro0.2570.1980.413
Gemini 2.5 Flash0.2390.2080.510
Table 1. Error rate by resource tier. WER for whitespace-tokenized languages, CER for CJK. Lower is better. "unsup." = unsupported (HTTP 400 or empty transcript). Claude text reasoners (control) are omitted; all three score under 0.02 on every tier because they receive the reference transcript.

The tier gap is large and monotonic: even within the audio-native MLLMs that do support all tiers, Gemini 2.5 Pro shows roughly a 1.6× WER increase from high to low resource. For dedicated ASR, the four production providers cluster within 0.013 WER of each other on high-resource (0.201–0.214), so the choice between them is essentially a wash at the top of the resource hierarchy. Differences become decision-relevant only as we descend tiers.

Where models fail by language

GPT-4o Audio averages WER above 1.0 on low-resource (2.829) — the model effectively hallucinates more text than the reference contains. Its mini variant is similarly catastrophic. Gemini 2.5 Pro and Flash, in contrast, track dedicated ASR closely on high- and mid-resource. On low-resource, ElevenLabs Scribe (0.409) and Gemini 2.5 Pro (0.413) are statistically tied at the top. The audio-native MLLM category is not monolithic: provider-level differences inside the category are larger than the gap between the category and dedicated ASR.

Code-switching is a separate failure regime. Even Gemini 2.5 Pro produces a boundary WER of 35.7% in a ±50-character window around switch points, more than 5× its monolingual high-resource WER. The dominant failure mode is language collapse: the model ignores the switch and transcribes the entire boundary region in a single language, accounting for 31–44% of boundary errors across models.

ModelBoundary WER ↓Lang ID Acc ↑
Gemini 2.5 Pro0.35771.2%
GPT-4o Audio0.38267.3%
Gemini 2.5 Flash0.41562.8%
Whisper v30.42958.4%
Table 2. Code-switching boundary performance. Boundary WER computed in ±50-character windows around switch points.

Accent variance and cultural reasoning

Within-language accent variance reveals that models treat different accents of the same language very differently. For English, WER standard deviation across accents ranges from 3.2 percentage points (GPT-4o Audio) to 8.7 (GPT-4o-mini Audio). For Arabic, accent variance is even larger: WER range (max minus min across accents) exceeds 15 percentage points for most models. A single summary WER hides this internal dispersion.

Figure

0.000.050.100.150.200.253.20GPT-4o Audio5.10Gemini 2.5 Pro6.30ElevenLabs Scribe8.70GPT-4o-mini Audio
Figure 2. Cross-accent WER standard deviation on English (lower = more accent-robust). The spread between the most and least accent-stable model is itself nearly 3×.

Transcription is the floor. Cultural reasoning is the ceiling. Models today are close to the floor in high-resource settings, blocked by HTTP 400 errors at the floor in low-resource settings, and far from the ceiling everywhere.

Implications

The linguistic equity gap is a training data story. The models we evaluated are not architecturally incapable of serving low-resource languages. They are incapable because the data they were trained on does not represent those languages fairly. Closing the gap requires intentional data work: collecting and curating speech and text from the languages that are currently underserved, and pairing that with cultural context annotations by native speakers.

GlobalVoice-Bench is the first axis in a longer measurement program. Future iterations will add code-switched utterances, speaker age and gender balance, and domain-specific terminology. The goal is to make equity gaps visible enough that they become unavoidable.

Cite this work

@article{datoric02,
  title={GlobalVoice-Bench: Measuring linguistic equity gaps in multilingual speech AI},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.ai/research/globalvoice-bench}
}

Data sources

  • FLEURS
  • Native speaker annotators

Continue reading