Paper № 05Multilingual12 min read

BharatVoice-Bench

Independent benchmarking of frontier voice models on Indian languages

Published

April 16, 2026

Authors

Datoric Research

Dataset

160-sample stratified subset (11,487 curated). 6 frontier and Indic-specialist systems. 10 Indic languages plus Hindi-English code-switching. Bootstrap 95% CI.

Download PDF View PDF

Abstract

India has 22 constitutionally-scheduled languages and over 450 living languages, yet every public benchmark of frontier voice AI on Indic speech is either vendor-authored or restricted to a single capability axis. We introduce BharatVoice-Bench, the first independent, multi-axis evaluation of six frontier and Indic-specialist voice models — ElevenLabs Scribe v2, Deepgram Nova-3 Multilingual, Sarvam Saaras v3, GPT-4o Transcribe, GPT-4o-mini Transcribe, and AssemblyAI Universal-3 Pro — on 160 audio samples spanning 10 Indic languages plus Hindi-English code-switching. ElevenLabs Scribe v2 leads at WER 0.277, with Sarvam Saaras v3 (0.308) and Deepgram Nova-3 (0.350) statistically tied. Three findings stand out. First, only ElevenLabs Scribe v2 and Sarvam Saaras v3 cover all 10 target languages — Odia is silently rejected by OpenAI, Deepgram, and AssemblyAI; Deepgram additionally lacks Malayalam and Punjabi. Second, AssemblyAI Universal-3 Pro silently script-collapses 5 of the 9 Indic languages it accepts: it returns romanized Latin transcripts for Bengali, Gujarati, Malayalam, Punjabi, and Telugu, producing surface-plausible WER numbers that are invisible without a Script Fidelity Rate metric. Third, public Dravidian-English code-switch corpora are essentially nonexistent — even after mining 9,000 IndicVoices conversations, Tamil-English and Bengali-English samples remain near-zero in mid- and high-CMI buckets, a corpus-level gap that propagates into every model's evaluation.

Headline

5 of 9

Indic languages that AssemblyAI Universal-3 Pro silently collapses to romanized Latin script — invisible to WER alone

Why Indic, why now

Voice AI for Indian languages is simultaneously one of the largest commercial opportunities in contemporary speech technology — 22 official languages, roughly 1.4 billion speakers — and one of the least benchmarked. Frontier multimodal models from OpenAI, Google, and Anthropic, alongside specialized speech providers Deepgram, AssemblyAI, and ElevenLabs, are deployed globally and claim multilingual coverage. Indic-native specialists like Sarvam Saaras claim to beat the frontier on Indian languages. Yet these claims have been compared against each other only in vendor blog posts, never in an independent reproducible benchmark.

BharatVoice-Bench is the first independent measurement of frontier and Indic-specialist voice models on the same Indic test set with the same normalization. Three observations motivate it. The default Whisper-style normalizer strips the very characters that carry phonetic content in Brahmi-family scripts, hiding 21–152% of true Indic WER. Transcription quality alone misses silent script collapse, where a model returns Devanagari for Malayalam audio. And frontier-vs-specialist comparisons currently exist only as vendor self-reports.

Headline finding

ElevenLabs Scribe v2 leads at WER 0.277 over 10 Indic languages. AssemblyAI Universal-3 Pro silently script-collapses 5 of the 9 Indic languages it accepts, returning romanized Latin output that produces surface-plausible WER but is unusable downstream. Only ElevenLabs Scribe v2 and Sarvam Saaras v3 cover all 10 target languages.

The benchmark

BharatVoice-Bench evaluates six systems on a 160-sample stratified subset drawn from a 11,487-sample curated corpus of FLEURS, AI4Bharat IndicVoices, AI4Bharat Svarah, and HiACC. Every (model, language) cell carries 95% bootstrap confidence intervals over 10,000 resamples. References are normalized via the IndicNLP library — preserving matras and viramas that Whisper's default BasicTextNormalizer strips — before WER and CER are computed. The benchmark has three independent axes:

01Transcription fidelity. WER and CER per (model, language) with bootstrap CIs; pairwise significance tests on aggregate WER.
02Script Fidelity Rate (SFR). Fraction of output characters in the expected script for the target language. Cells below 0.5 indicate script collapse: the model is producing output dominantly in the wrong script. Invisible to WER.
03Code-switching. CMI-bucketed WER on Hindi-English, switch-point F1 on language-boundary prediction, and an LLM-as-judge Entity Preservation score (Claude Opus 4.6) that catches semantic drift WER misses.

We also expose a fourth implicit axis: API coverage. Several frontier providers silently return HTTP 400 errors for specific (provider, language) pairs that no model card documents. We treat the coverage matrix as a first-class result.

The leaderboard

Table 1 reports overall WER and the composite score. ElevenLabs Scribe v2 leads at WER 0.277 (95% CI 0.244–0.311), achieving the lowest WER on 7 of 10 Indic languages. Sarvam Saaras v3 (0.308) and Deepgram Nova-3 Multilingual (0.350) form a statistical cluster — paired bootstrap p = 0.38 between the two, meaning we cannot reliably rank them on a 160-sample test. The two OpenAI transcribe variants are similarly indistinguishable from each other (p = 0.30). AssemblyAI Universal-3 Pro is the structural outlier: an aggregate WER of 0.843 reflects not a quality issue alone but the script collapse problem we describe in the next section.

Model	WER ↓	CER ↓	SFR ↑	CS WER ↓	Composite
ElevenLabs Scribe v2	0.277	0.115	0.964	0.323	0.472
Sarvam Saaras v3	0.308	0.131	0.996	0.444	0.383
Deepgram Nova-3 Multi.	0.350	0.158	0.957	0.325	0.420
GPT-4o Transcribe	0.408	0.222	0.998	0.547	0.268
GPT-4o-mini Transcribe	0.419	0.199	0.988	0.474	0.302
AssemblyAI Univ.-3 Pro	0.843	0.611	0.414	0.683	0.021

Table 1. BharatVoice-Bench leaderboard. WER and CER are IndicNLP-normalized; SFR averages over languages each model accepts; CS WER averages over Hindi-English code-switch buckets. Lower is better for WER, CER, and CS WER; higher is better for SFR.

Figure 1

WER

0.0 to 1.0

ElevenLabs Scribe v2

0.28

Sarvam Saaras v3

0.31

Deepgram Nova-3

0.35

GPT-4o Transcribe

0.41

GPT-4o-mini Transcribe

0.42

AssemblyAI Univ.-3 Pro

0.84

Figure 1. Aggregate WER (lower is better) across all six evaluated systems. ElevenLabs Scribe v2 leads; Sarvam Saaras v3 and Deepgram Nova-3 are statistically tied behind it; AssemblyAI Universal-3 Pro is the structural outlier driven by the script collapse failure mode described below.

The script collapse problem

Script Fidelity Rate measures the fraction of output characters that are in the expected script for the target language. A high WER paired with a high SFR means the model is trying — making word-level errors but in the right alphabet. A low SFR means something more pathological is happening: the model is silently producing output in the wrong script entirely. WER cannot tell these apart, but customers care enormously about the difference. A romanized Latin transcript of Malayalam audio is unusable for downstream Indic NLP regardless of what its WER number says.

AssemblyAI Universal-3 Pro is the cleanest example of script collapse in the wild. When asked to transcribe Bengali, Gujarati, Malayalam, Punjabi, or Telugu audio, the universal-3-pro endpoint silently falls back to its universal-2 predecessor, which lacks native Indic-script support and returns romanized Latin text. SFR for those five language cells is 0.0 — every sample is in the wrong script. WER on the same cells exceeds 1.0 because every romanized token counts as an error against an Indic-script reference. Without SFR, this looks like a quality problem; with SFR, it is unmistakably a coverage failure dressed up as transcription.

Model	Ben	Guj	Hin	Kan	Mal	Mar	Ori	Pan	Tam	Tel
ElevenLabs Scribe v2	0.98	1.00	0.88	0.99	1.00	1.00	0.97	0.90	0.93	0.99
Sarvam Saaras v3	0.99	1.00	1.00	1.00	1.00	1.00	1.00	0.99	1.00	1.00
Deepgram Nova-3 Multi.	0.94	0.99	0.82	1.00	—	1.00	—	—	0.95	1.00
GPT-4o Transcribe	0.99	1.00	1.00	1.00	1.00	1.00	—	—	1.00	1.00
GPT-4o-mini Transcribe	0.99	0.99	0.93	1.00	1.00	1.00	—	—	1.00	1.00
AssemblyAI Univ.-3 Pro	0.00	0.00	0.81	0.91	0.00	1.00	—	0.00	1.00	0.00

Table 2. Script Fidelity Rate (SFR) per (model, language). Cells below 0.5 indicate script collapse — the dominant output script is not the expected target script. A dash indicates the model returned an HTTP 400 or similar refusal. The lower-right block reveals the AssemblyAI script-collapse pattern.

WER on a script-collapsed transcript is meaningless: every token is an error even if the phonetic content is preserved. SFR turns an invisible deployment-killing failure into a visible coverage finding.

Coverage as a deployment cliff

Across the six systems we evaluated, only ElevenLabs Scribe v2 and Sarvam Saaras v3 successfully transcribed all 10 Indic languages in our test set. Odia is rejected outright by OpenAI Transcribe, GPT-4o-mini Transcribe, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Multilingual — the API returns HTTP 400 with no transcript. Deepgram additionally rejects Malayalam and Punjabi. None of these silent-rejection patterns is surfaced in vendor documentation. For a customer building an Odia-language voice product, discovering at integration time that four of six providers will simply refuse the audio is a high-stakes, deployment-defining surprise.

Coverage gaps disproportionately affect speakers of languages with less commercial buying power. Odia (~38 million native speakers), Punjabi (~125 million), and Malayalam (~35 million) are not small communities. ElevenLabs Scribe v2 and Sarvam Saaras v3 demonstrate that broad Indic coverage is feasible for motivated providers; the absence of coverage from the others reflects commercial neglect, not technical impossibility. We argue this coverage matrix belongs in every public speech-model card.

Code-switching: where the gap is largest

Hindi-English is the only code-switch pair for which we obtained sufficient samples across low, mid, and high CMI buckets. Even so, the per-bucket pattern is informative. ElevenLabs Scribe v2 and Deepgram Nova-3 remain relatively flat across CMI intensity (WER ≈ 0.18–0.30), indicating that their advertised code-switch handling is real on Hindi-English. GPT-4o-mini Transcribe shows a sharp WER jump from 0.19 (low CMI) to 0.59 (mid CMI) — a 3× degradation as English-token density increases inside a predominantly Hindi utterance. Sarvam Saaras v3, despite leading on monolingual Indic, climbs from 0.29 (low) to 0.64 (high) as code-mixing intensifies.

Model	Low CMI	Mid CMI	High CMI
Deepgram Nova-3 Multi.	0.273	0.177	0.199
ElevenLabs Scribe v2	0.275	0.300	0.222
Sarvam Saaras v3	0.287	0.505	0.638
GPT-4o Transcribe	0.428	0.454	0.488
GPT-4o-mini Transcribe	0.186	0.590	0.658
AssemblyAI Univ.-3 Pro	0.450	0.533	0.660

Table 3. Hindi-English code-switching WER by Code-Mixing Index bucket. The flat-curve providers are the ones whose vendor claims about code-switch handling hold up empirically; the GPT-4o-mini cliff between low and mid CMI is the most striking single finding.

Switch-point F1 (token-level language-boundary prediction) and Entity Preservation (LLM-as-judge fraction of named entities preserved through the transcript) reveal additional structure. ElevenLabs Scribe v2 leads switch-point F1 at 0.42, with Deepgram Nova-3 close behind at 0.40. GPT-4o-mini Transcribe and Sarvam Saaras v3 both score 0.0 on switch-point F1 — they do not preserve mixed-language token boundaries at all. Sarvam Saaras v3 also scores 0.0 on Entity Preservation, almost certainly because the Indic-only model transliterates English entities into Indic script rather than preserving them in Latin. Useful for monolingual Indic deployments; problematic for any pipeline that downstreams the entities to text-only systems.

The harder finding is what is missing. Despite mining 9,000 IndicVoices conversational utterances with aggressive transliterated-English detection, we found only a few hundred Tamil-English and Bengali-English code-switch samples — almost all in the low-CMI bucket. Real-world Tanglish and Madras Bashai routinely sustain mid- and high-CMI mixing; the gap is in corpus curation, not in speaker behavior. We report this as a corpus-level scarcity that propagates into every model's code-switch coverage on Dravidian-English pairs.

Discussion

BharatVoice-Bench's headline result is that the strongest Indic frontier system is dedicated ASR — ElevenLabs Scribe v2 — and the strongest Indic-specialist (Sarvam Saaras v3) is statistically tied with the strongest non-Indic dedicated ASR (Deepgram Nova-3) on aggregate WER. OpenAI's audio-native transcribe variants trail dedicated ASR by 0.10–0.14 WER on Indic. The narrative that frontier multimodal models subsume specialized ASR does not hold on Indian languages.

The methodological contribution we care most about is Script Fidelity Rate. WER alone is not enough to measure Indic transcription quality: it conflates word-level errors with silent script collapse, and a romanized Latin transcript of Bengali audio is unusable downstream regardless of the WER number it reports. Vendors should publish SFR alongside WER. Customers should ask for it. We are releasing the benchmark, the per-(model, language) coverage matrix, the IndicNLP normalization, and the LLM-judge prompts so that other labs can reproduce and extend these numbers.

Cite this work

@article{datoric05,
  title={BharatVoice-Bench: Independent benchmarking of frontier voice models on Indian languages},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.ai/research/bharatvoice-bench}
}

Data sources

→FLEURS (10 Indic configs)
→AI4Bharat IndicVoices
→AI4Bharat Svarah
→HiACC

№ 04Safety

VideoTruth-Bench

Measuring sycophancy, hallucination, and calibration in video understanding models

Read paper

№ 03Video

VidWork-Bench

Temporal reasoning and professional workflow understanding in video models

Read paper