Paper № 02Fairness12 min read

GlobalVoice-Bench

Measuring linguistic equity gaps in multilingual speech AI

Published

March 10, 2026

Authors

Datoric Research

Dataset

Four axes: 800 FLEURS samples across 20 languages in 3 tiers (200 evaluated for per-tier transcription), 150 ASCEND Mandarin–English code-switch samples, 745 Common Voice 17 accent samples across 7 languages × 22 accent cells, and 800 culturally-grounded recordings (40 per language × 20 languages). 12 frontier systems. Bootstrap 95% CI.

Download PDF View PDF

Abstract

GlobalVoice-Bench is an 800-sample multilingual voice benchmark spanning 20 languages across three resource tiers, with four axes: per-tier transcription, Mandarin–English code-switching from ASCEND, accent-sensitivity variance across seven languages from Common Voice 17 (745 samples, 22 accent cells), and an 800-sample culturally-grounded transcription axis. We evaluate 12 frontier systems: five dedicated ASR providers (Whisper v3, Deepgram Nova-3, Nova-2, AssemblyAI Universal-2, ElevenLabs Scribe), four audio-native multimodal LLMs (GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, Gemini 2.5 Flash), and three Claude text-reasoner controls. The per-tier transcription axis quantifies a language equity gap that is monotonic and large: the best audio-native model (Gemini 2.5 Pro) shows a 1.6× WER jump from high to low resource (0.257 → 0.413), and AssemblyAI Universal-2 shows a 4.3× jump (0.201 → 0.856). Three of five dedicated ASR providers return no usable transcript on our low-resource tier — Deepgram Nova-3 and Nova-2 return HTTP 400, AssemblyAI Universal-2 produces WER above 0.85. On Mandarin–English code-switched speech (ASCEND, 150 samples), Deepgram Nova-2 returns empty transcripts on 49.3% of samples and Gemini 2.5 Flash on 28.7% — a silent-drop mode that WER leaderboards hide. Only ElevenLabs Scribe covers all 20 languages on the cultural-QA axis at 100%.

Headline

3 of 5

dedicated ASR providers cannot serve the low-resource tier reliably — Deepgram returns HTTP 400; AssemblyAI Universal-2 returns WER above 0.85

Why equity, not just accuracy

Public speech benchmarks typically report a single summary word error rate aggregated across the languages a model happens to support. This obscures a critical property: the worst-served language is the one that matters most for the last billion users. A model that averages a 3% word error rate but returns HTTP 400 on Javanese, Yoruba, or Amharic is not a multilingual model in any practical sense.

GlobalVoice-Bench reframes voice AI evaluation around the language equity gap: the performance differential between high- and low-resource languages under identical evaluation protocols. The benchmark reports four axes — per-tier transcription, code-switching, accent-sensitivity variance, and culturally-grounded transcription — so that a provider's behaviour can be evaluated on the dimensions that actually matter for multilingual deployment, not a single tier-averaged number.

Main result

Three of five dedicated ASR providers (Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2) do not produce usable transcripts on our low-resource tier: Deepgram returns HTTP 400 on Swahili, Amharic, Hausa, Yoruba, Igbo, and Javanese; AssemblyAI Universal-2 returns WER above 0.85. Only ElevenLabs Scribe (WER 0.409) and Gemini 2.5 Pro (0.413) remain usable, statistically tied at the top.

The benchmark

GlobalVoice-Bench samples 800 FLEURS utterances (40 per language) across 20 languages organized into three resource tiers — seven high-resource (English, Mandarin, Spanish, French, German, Russian, Japanese), seven mid-resource (Hindi, Arabic, Portuguese, Turkish, Korean, Vietnamese, Polish), and six low-resource (Swahili, Amharic, Hausa, Yoruba, Igbo, Javanese). A 200-sample balanced subset drives the reported per-tier results. We evaluate 12 frontier systems organized into three classes:

01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
02Audio-native multimodal LLMs. GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
03Text reasoners on reference transcripts (upper-bound control). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5.

Evaluation is performed along four axes: per-tier transcription (WER, CER on CJK), Mandarin–English code-switching from ASCEND (150 samples, 356 annotated switch-point boundaries), accent-sensitivity variance from Common Voice 17 (745 samples across seven languages and 22 accent cells), and an 800-sample culturally-grounded transcription axis (40 per language × 20 languages). All scores carry 95% bootstrap confidence intervals.

The equity gap

Figure

High resource

Mid resource

Low resource

Whisper v3

gap 2.5x

.214

.232

.529

Deepgram Nova-3

unsup.

.210

.228

unsup.

Deepgram Nova-2

unsup.

.213

.246

unsup.

AssemblyAI Univ.-2

gap 4.3x

.201

.217

.856

ElevenLabs Scribe

gap 1.8x

.230

.217

.409

GPT-4o Audio

gap 3.8x

.742

.456

2.829

GPT-4o-mini Audio

gap 2.8x

.838

.574

2.377

Gemini 2.5 Pro

gap 1.6x

.257

.198

.413

Gemini 2.5 Flash

gap 2.1x

.239

.208

.510

Figure 1. WER by resource tier across the nine audio-facing systems (lower is better). Deepgram Nova-3 and Nova-2 are unsupported on the low-resource tier; AssemblyAI Universal-2 returns a low-resource WER above 0.85 and is not practically usable. GPT-4o Audio's bars exceed 1.0 because the model hallucinates more text than the reference contains.

Model	High	Mid	Low
Whisper v3	0.214	0.232	0.529
Deepgram Nova-3	0.210	0.228	unsup.
Deepgram Nova-2	0.213	0.246	unsup.
AssemblyAI Univ.-2	0.201	0.217	0.856
ElevenLabs Scribe	0.230	0.217	0.409
GPT-4o Audio	0.742	0.456	2.829
GPT-4o-mini Audio	0.838	0.574	2.377
Gemini 2.5 Pro	0.257	0.198	0.413
Gemini 2.5 Flash	0.239	0.208	0.510

Table 1. Error rate by resource tier. WER for whitespace-tokenized languages, CER for CJK. Lower is better. "unsup." = unsupported (HTTP 400 or empty transcript). Claude text reasoners (control) are omitted; all three score under 0.02 on every tier because they receive the reference transcript.

The tier gap is large and monotonic: even within the audio-native MLLMs that do support all tiers, Gemini 2.5 Pro shows roughly a 1.6× WER increase from high to low resource. For dedicated ASR, the four production providers cluster within 0.013 WER of each other on high-resource (0.201–0.214), so the choice between them is essentially a wash at the top of the hierarchy. Differences become decision-relevant only as we descend tiers.

Code-switching and silent drops

GPT-4o Audio averages WER above 1.0 on low-resource (2.829) — the model is producing more text than the reference contains, dominated by hallucination and repetition. Its mini variant is similarly broken. Gemini 2.5 Pro and Flash, in contrast, track dedicated ASR closely on high- and mid-resource. On low-resource, ElevenLabs Scribe (0.409) and Gemini 2.5 Pro (0.413) are statistically tied at the top. The audio-native MLLM category is not monolithic: provider-level differences inside the category are larger than the gap between the category and dedicated ASR.

The ASCEND Mandarin–English code-switching axis (150 samples, 356 annotated switch points) reveals a silent-drop failure mode that tier-transcription leaderboards hide. Deepgram Nova-2 returns an empty transcript on 49.3% of samples and Gemini 2.5 Flash on 28.7% — the HTTP request succeeds, the body is empty. On the non-empty windows, Gemini 2.5 Flash leads boundary WER at 0.781 and Gemini 2.5 Pro at 0.821, with ElevenLabs Scribe the best dedicated ASR at 0.932. No audio model scores switch-point language-ID accuracy above 0.343; even the best audio model mis-attributes the language on more than two-thirds of switch points. Code-switched speech is a research ceiling today, not a vendor-selection problem.

Model	Refuse %	Boundary WER ↓	LID acc ↑
ElevenLabs Scribe	0.0	0.932	0.343
Whisper v3	0.7	0.948	0.164
AssemblyAI Univ.-2	2.0	0.989	0.145
Deepgram Nova-3	6.7	2.065	0.000
Deepgram Nova-2	49.3	1.501	0.000
GPT-4o Audio	4.7	1.064	0.135
GPT-4o-mini Audio	0.0	1.649	0.067
Gemini 2.5 Pro	2.7	0.821	0.206
Gemini 2.5 Flash	28.7	0.781	0.288

Table 2. ASCEND Mandarin–English code-switching. "Refuse %" is the fraction of samples returning an empty transcript; "Boundary WER" is computed on ±50-character windows around the 356 annotated switch points, averaged over non-empty windows.

Accent variance and cultural-QA coverage

The accent-sensitivity axis (Common Voice 17, 745 samples across seven languages and 22 accent cells) measures the across-accent WER standard deviation per model — a small value means WER is roughly constant across regional dialects of the same language, a large value means WER is heavily driven by which accent a sample happens to carry. Gemini 2.5 Flash is the most accent-robust audio-native model we evaluated (mean across-accent std 0.039). Within-language dispersion is dominated by Arabic (MSA vs Egyptian std 0.09–0.16 for most providers) and Italian (Northern/Southern/Central, 0.04–0.07); Spanish accent cells are essentially interchangeable for every top provider (std ≤ 0.032). The practical implication: "handles accent X for language Y" does not predict accent robustness in language Z.

Figure

Figure 2. Mean across-accent WER standard deviation across seven languages (lower = more accent-robust). Gemini 2.5 Flash and Deepgram Nova-2 are statistically tied at the top; AssemblyAI Universal-2's higher figure is driven by its Arabic cell.

The cultural-QA axis expands from the original comprehension pilot into an 800-sample per-language transcription evaluation on culturally-grounded recordings (40 per language × 20 languages). Coverage is the discriminating signal: ElevenLabs Scribe is the only provider with 100% effective non-empty hypothesis coverage across all 20 languages; AssemblyAI Universal-2 reaches 90%, Whisper large-v3 75%, Deepgram Nova-3 70%, Deepgram Nova-2 65% (and Nova-2 specifically returns HTTP 400 on Arabic — a provider-API gap, not a benchmark artifact). Post day-3 backfill, the audio-native MLLMs cluster near the top: GPT-4o-mini Audio 100%, GPT-4o Audio 99.9%, Gemini 2.5 Flash 99.6%, and Gemini 2.5 Pro 95.0% with its 40 residual failures all on Yoruba. GPT-4o Audio's per-tier WER on the cultural-QA audio is 0.672 / 0.913 / 2.369 (high/mid/low) — paraphrasing and hallucination, not missing.

Transcription is the floor, coverage is the ceiling. Models today are close to the floor in high-resource settings, blocked by HTTP 400 errors at the floor in low-resource settings, and inconsistently covered everywhere in between.

Implications

The language equity gap is a training-data and language-support story. The models we evaluated are not architecturally incapable of serving low-resource languages — ElevenLabs Scribe and Gemini 2.5 Pro demonstrate that broad coverage is feasible. The absence of reliable coverage from Deepgram and AssemblyAI on the low-resource tier reflects commercial neglect, not technical impossibility. We argue the per-(provider, language) coverage matrix belongs in every public speech-model card.

Refusal rate, not boundary WER, is the dominant failure mode for two production-grade audio systems on code-switched Mandarin–English: Deepgram Nova-2 refuses 49.3% of samples, Gemini 2.5 Flash refuses 28.7%. Any multilingual deployment pipeline must treat refusal rate as a first-class metric alongside WER — a silent empty transcript passed downstream is worse than a noisy transcript flagged for review.

Cite this work

@article{datoric02,
  title={GlobalVoice-Bench: Measuring linguistic equity gaps in multilingual speech AI},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.com/research/globalvoice-bench}
}

Data sources

→FLEURS
→Common Voice 17
→VoxPopuli
→ASCEND

№ 04Safety

VideoTruth-Bench

Cross-modal consistency verification across six contradiction levels

Read paper

№ 03Video

VidWork-Bench

A five-axis benchmark for procedural video understanding

Read paper