Why Indic, why now
Voice AI for Indian languages is simultaneously one of the largest commercial opportunities in contemporary speech technology — 22 official languages, roughly 1.4 billion speakers — and one of the least benchmarked. Frontier multimodal models from OpenAI, Google, and Anthropic, alongside specialized speech providers Deepgram, AssemblyAI, and ElevenLabs, are deployed globally and claim multilingual coverage. Indic-native specialists like Sarvam Saaras claim to beat the frontier on Indian languages. Yet these claims have been compared against each other only in vendor blog posts, never in an independent reproducible benchmark.
BharatVoice-Bench is the first independent measurement of frontier and Indic-specialist voice models on the same Indic test set with the same normalization. Three observations motivate it. The default Whisper-style normalizer strips the very characters that carry phonetic content in Brahmi-family scripts, hiding 21–152% of true Indic WER. Transcription quality alone misses silent script collapse, where a model returns Devanagari for Malayalam audio. And frontier-vs-specialist comparisons currently exist only as vendor self-reports.
Headline finding
ElevenLabs Scribe v2 leads at WER 0.277 over 10 Indic languages. AssemblyAI Universal-3 Pro silently script-collapses 5 of the 9 Indic languages it accepts, returning romanized Latin output that produces surface-plausible WER but is unusable downstream. Only ElevenLabs Scribe v2 and Sarvam Saaras v3 cover all 10 target languages.
The benchmark
BharatVoice-Bench evaluates six systems on a 160-sample stratified subset drawn from a 11,487-sample curated corpus of FLEURS, AI4Bharat IndicVoices, AI4Bharat Svarah, and HiACC. Every (model, language) cell carries 95% bootstrap confidence intervals over 10,000 resamples. References are normalized via the IndicNLP library — preserving matras and viramas that Whisper's default BasicTextNormalizer strips — before WER and CER are computed. The benchmark has three independent axes:
- 01Transcription fidelity. WER and CER per (model, language) with bootstrap CIs; pairwise significance tests on aggregate WER.
- 02Script Fidelity Rate (SFR). Fraction of output characters in the expected script for the target language. Cells below 0.5 indicate script collapse: the model is producing output dominantly in the wrong script. Invisible to WER.
- 03Code-switching. CMI-bucketed WER on Hindi-English, switch-point F1 on language-boundary prediction, and an LLM-as-judge Entity Preservation score (Claude Opus 4.6) that catches semantic drift WER misses.
We also expose a fourth implicit axis: API coverage. Several frontier providers silently return HTTP 400 errors for specific (provider, language) pairs that no model card documents. We treat the coverage matrix as a first-class result.
The leaderboard
Table 1 reports overall WER and the composite score. ElevenLabs Scribe v2 leads at WER 0.277 (95% CI 0.244–0.311), achieving the lowest WER on 7 of 10 Indic languages. Sarvam Saaras v3 (0.308) and Deepgram Nova-3 Multilingual (0.350) form a statistical cluster — paired bootstrap p = 0.38 between the two, meaning we cannot reliably rank them on a 160-sample test. The two OpenAI transcribe variants are similarly indistinguishable from each other (p = 0.30). AssemblyAI Universal-3 Pro is the structural outlier: an aggregate WER of 0.843 reflects not a quality issue alone but the script collapse problem we describe in the next section.
| Model | WER ↓ | CER ↓ | SFR ↑ | CS WER ↓ | Composite |
|---|---|---|---|---|---|
| ElevenLabs Scribe v2 | 0.277 | 0.115 | 0.964 | 0.323 | 0.472 |
| Sarvam Saaras v3 | 0.308 | 0.131 | 0.996 | 0.444 | 0.383 |
| Deepgram Nova-3 Multi. | 0.350 | 0.158 | 0.957 | 0.325 | 0.420 |
| GPT-4o Transcribe | 0.408 | 0.222 | 0.998 | 0.547 | 0.268 |
| GPT-4o-mini Transcribe | 0.419 | 0.199 | 0.988 | 0.474 | 0.302 |
| AssemblyAI Univ.-3 Pro | 0.843 | 0.611 | 0.414 | 0.683 | 0.021 |
Figure 1
WER
0.0 to 1.0
The script collapse problem
Script Fidelity Rate measures the fraction of output characters that are in the expected script for the target language. A high WER paired with a high SFR means the model is trying — making word-level errors but in the right alphabet. A low SFR means something more pathological is happening: the model is silently producing output in the wrong script entirely. WER cannot tell these apart, but customers care enormously about the difference. A romanized Latin transcript of Malayalam audio is unusable for downstream Indic NLP regardless of what its WER number says.
AssemblyAI Universal-3 Pro is the cleanest example of script collapse in the wild. When asked to transcribe Bengali, Gujarati, Malayalam, Punjabi, or Telugu audio, the universal-3-pro endpoint silently falls back to its universal-2 predecessor, which lacks native Indic-script support and returns romanized Latin text. SFR for those five language cells is 0.0 — every sample is in the wrong script. WER on the same cells exceeds 1.0 because every romanized token counts as an error against an Indic-script reference. Without SFR, this looks like a quality problem; with SFR, it is unmistakably a coverage failure dressed up as transcription.
| Model | Ben | Guj | Hin | Kan | Mal | Mar | Ori | Pan | Tam | Tel |
|---|---|---|---|---|---|---|---|---|---|---|
| ElevenLabs Scribe v2 | 0.98 | 1.00 | 0.88 | 0.99 | 1.00 | 1.00 | 0.97 | 0.90 | 0.93 | 0.99 |
| Sarvam Saaras v3 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 |
| Deepgram Nova-3 Multi. | 0.94 | 0.99 | 0.82 | 1.00 | — | 1.00 | — | — | 0.95 | 1.00 |
| GPT-4o Transcribe | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | — | — | 1.00 | 1.00 |
| GPT-4o-mini Transcribe | 0.99 | 0.99 | 0.93 | 1.00 | 1.00 | 1.00 | — | — | 1.00 | 1.00 |
| AssemblyAI Univ.-3 Pro | 0.00 | 0.00 | 0.81 | 0.91 | 0.00 | 1.00 | — | 0.00 | 1.00 | 0.00 |
WER on a script-collapsed transcript is meaningless: every token is an error even if the phonetic content is preserved. SFR turns an invisible deployment-killing failure into a visible coverage finding.
Coverage as a deployment cliff
Across the six systems we evaluated, only ElevenLabs Scribe v2 and Sarvam Saaras v3 successfully transcribed all 10 Indic languages in our test set. Odia is rejected outright by OpenAI Transcribe, GPT-4o-mini Transcribe, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Multilingual — the API returns HTTP 400 with no transcript. Deepgram additionally rejects Malayalam and Punjabi. None of these silent-rejection patterns is surfaced in vendor documentation. For a customer building an Odia-language voice product, discovering at integration time that four of six providers will simply refuse the audio is a high-stakes, deployment-defining surprise.
Coverage gaps disproportionately affect speakers of languages with less commercial buying power. Odia (~38 million native speakers), Punjabi (~125 million), and Malayalam (~35 million) are not small communities. ElevenLabs Scribe v2 and Sarvam Saaras v3 demonstrate that broad Indic coverage is feasible for motivated providers; the absence of coverage from the others reflects commercial neglect, not technical impossibility. We argue this coverage matrix belongs in every public speech-model card.
Code-switching: where the gap is largest
Hindi-English is the only code-switch pair for which we obtained sufficient samples across low, mid, and high CMI buckets. Even so, the per-bucket pattern is informative. ElevenLabs Scribe v2 and Deepgram Nova-3 remain relatively flat across CMI intensity (WER ≈ 0.18–0.30), indicating that their advertised code-switch handling is real on Hindi-English. GPT-4o-mini Transcribe shows a sharp WER jump from 0.19 (low CMI) to 0.59 (mid CMI) — a 3× degradation as English-token density increases inside a predominantly Hindi utterance. Sarvam Saaras v3, despite leading on monolingual Indic, climbs from 0.29 (low) to 0.64 (high) as code-mixing intensifies.
| Model | Low CMI | Mid CMI | High CMI |
|---|---|---|---|
| Deepgram Nova-3 Multi. | 0.273 | 0.177 | 0.199 |
| ElevenLabs Scribe v2 | 0.275 | 0.300 | 0.222 |
| Sarvam Saaras v3 | 0.287 | 0.505 | 0.638 |
| GPT-4o Transcribe | 0.428 | 0.454 | 0.488 |
| GPT-4o-mini Transcribe | 0.186 | 0.590 | 0.658 |
| AssemblyAI Univ.-3 Pro | 0.450 | 0.533 | 0.660 |
Switch-point F1 (token-level language-boundary prediction) and Entity Preservation (LLM-as-judge fraction of named entities preserved through the transcript) reveal additional structure. ElevenLabs Scribe v2 leads switch-point F1 at 0.42, with Deepgram Nova-3 close behind at 0.40. GPT-4o-mini Transcribe and Sarvam Saaras v3 both score 0.0 on switch-point F1 — they do not preserve mixed-language token boundaries at all. Sarvam Saaras v3 also scores 0.0 on Entity Preservation, almost certainly because the Indic-only model transliterates English entities into Indic script rather than preserving them in Latin. Useful for monolingual Indic deployments; problematic for any pipeline that downstreams the entities to text-only systems.
The harder finding is what is missing. Despite mining 9,000 IndicVoices conversational utterances with aggressive transliterated-English detection, we found only a few hundred Tamil-English and Bengali-English code-switch samples — almost all in the low-CMI bucket. Real-world Tanglish and Madras Bashai routinely sustain mid- and high-CMI mixing; the gap is in corpus curation, not in speaker behavior. We report this as a corpus-level scarcity that propagates into every model's code-switch coverage on Dravidian-English pairs.
Discussion
BharatVoice-Bench's headline result is that the strongest Indic frontier system is dedicated ASR — ElevenLabs Scribe v2 — and the strongest Indic-specialist (Sarvam Saaras v3) is statistically tied with the strongest non-Indic dedicated ASR (Deepgram Nova-3) on aggregate WER. OpenAI's audio-native transcribe variants trail dedicated ASR by 0.10–0.14 WER on Indic. The narrative that frontier multimodal models subsume specialized ASR does not hold on Indian languages.
The methodological contribution we care most about is Script Fidelity Rate. WER alone is not enough to measure Indic transcription quality: it conflates word-level errors with silent script collapse, and a romanized Latin transcript of Bengali audio is unusable downstream regardless of the WER number it reports. Vendors should publish SFR alongside WER. Customers should ask for it. We are releasing the benchmark, the per-(model, language) coverage matrix, the IndicNLP normalization, and the LLM-judge prompts so that other labs can reproduce and extend these numbers.