Research
Every paper measures a specific failure pattern in a specific domain. We publish the full methodology, the dataset, and the statistical significance tests alongside every claim.
Research Notes
%%%%%%%%#-::::-----======-----+#%%%%%%%%%%%####### %%%%%%%%+:-----------------++#%%%%%%%%%%%%%%###### %%%%%%%+---::-----------==#%%%%%%%%%%%%%%%%%%##### %%%%%%+---:::::::------+##%%%%%%%%%%%%%%%%%%###### %%%%%#-=---=++*#%#++=-=%%%%%%%%%%%%%%%%%%%%%%%#### %%%%%%%%%%%%%%%%%#+==-#%%%%%%%%%%%%%%%%%%%%%%%%%## %%%%%%%%%%%%%%%%%*+**%%%%%%%%%%%%%%%%%%%%%%%%%%%%# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#%%%%%%%%%%%%##+=-=* #%%%%%%%%%%%%%%%%%%%%%%%%%%%%%=*%%%%%%#####**=-:-* +*%%%%%%%%%%%%%%%%%%%%%%%##%%#--#%%%####*#******** *#%%%%%%%%%%%%%%%%%%%%%%%+*%*:-:-###%%%%%%%####### *#%%%%%%%%%%%%%%%%%%%%%%###*:=+-:..-#%%%%%%%%%%%%% #%%%%%%%%%%%%%%%%%%%%%%%**+::....:=##%%###%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%*+*-:....:=*%#**#%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%*=--:::::-=##++##%%%%%%%%% %%%%%%%%%%%%%%####%%%%%%*---:::::--+=+*##%%%%%%%%% %%%%%%%%%%*+-:.:::-=#%%%#---:------+**##%%%%%%%%%% %%%%#+-::...::::::--=*#%%#=------=**###%%%%%%%%%%% #%%+:....:-+-::::---+*##%%%=---=#%%%%%%%%%%%%%##%% #%%::::::--------=-=+++*#%#%%%#+####%%%%%#*%%%%%## %%%*:::---====--===--=+**#%%%%%%##******+-+%%%%%#% %%%%+:::--====------=++*##%%%%%%%#*****+--#%%%%%%% @%%%%*::::--------=-=++*#%%%%%%%%#**+=+=--*%%%%%%% @@@%%%#-:::::::--++++*#%%%%%%%%%%%*+=-==---*%%%%%# @@%%%%%%::::----+##%%%%%%%%%%####*+--=+=---=%%%%*= @%%%%%%#:-==-===*%%%%##%%%%%##**=---:::--===#%*=-- %%%##%%#=+++++-==*%%#*%%%%%%##*+==----===+--#*:::- *****++***==++=====+++++******+=====+====+==**==-- *#**++++**========+===+*###*#*+==+++++=+===-****** ******++++===========**********==++++++=====+++++* ***********+========+*********+===++========-::--- **+*++******+==---==+***##**##**==---===-=+++++=== ****+++****#**===---==+****+**++++=------+***+==-- ***********#*#*+++====++******++++++=++=====++---- ********#***+*************+=====++++=+++***+=+=--- ************+++++++++++++++++=+++=++++++*+=+++++=- ##****#*****++++++++*+++***+++=======++=++===-=+== ##############****+=+*+*******++++++=+++==-=++++++ #########%%%#*+=+#*=+=+#######****#**#****+++++*** ######%######**##*=-=*#########*####*##*##*#*#*##* #*#####%##%##*=-==--+*#########*#######*########## ###%###%%####*+==--:-+##*###################**####
ElevenLabs Scribe v2 leads at WER 0.277 over 10 Indic languages. Script Fidelity Rate catches a failure invisible to WER: Whisper large-v3 Latin-transliterates 83% of Odia audio; AssemblyAI Universal-3 Pro script-collapses five Indic languages to romanized Latin. The strongest Indic specialist (Sarvam Saaras v3) is statistically tied with Deepgram Nova-3 on aggregate WER, and OpenAI's audio-native transcribe variants trail dedicated ASR by 0.10–0.14 WER — frontier multimodal models do not yet subsume specialized ASR on Indian languages.
April 16, 2026 · 12 MIN READ
Read paper
----:::::..................................:: --:::::::...........................:::...::: ::::..........................::::::::::::::: :::.........................::::::::::::::--- :::.........::::::.......::::------------==++ :::..........::::::....:::::-------==++++**#* ::...........:::::::::::----===++++*##%%@@@%@ :......:...:::::::::------=====+++*#########% :...::.:...:::::::::::::----::----===--=----= ................:::.....:::-----::::::::::::: ..............::::.....::::::-----::::::::::: .............:::............:::----:::::::::: .............................:::::......::::: ......:::...............................:.::: ....................::::.............:..::::: ..................::::::..............:=---:: .................:::::::..............-:::::. .....................:.........:===+=-:...... :...............................:::::-:.::... ::..........::................:--::::. ::.. :.......::..:::=-==.......................... . .::::::::::::::::.................::.... . ........ . ..:-==----:................ .:... .. .....-::::.............. .. .:.:.... .. .......:::--::::............... .. .............. .:::---:::::........:--=-: .. .-:----... ..:..:..:.:::. .....::::. . :. :---::==-=..::. .....:::............. ..:::::::-::::--.....:. ..-===+=-:-....:=-- .::::::::..::::........ .:::---::......:--= .:. .. .....:........::::.....:::::::.:- =-:. ...............::-:---::----------- +++:. ................:.:::::--========---== +=*=*---:. ...............:---------======-== ==+*#*#*:......::::----------==+============= ==#=*:=- .----=-===++===+++++===+++++++++= :-- ..-====+**++++++++++++============= . . ...:***#****************+++++++++=== .:.:......:*####**#######**######***+++++== ::::.......:*##################**#*#***+++++
Three frontier models from three vendors tie at ~87% overall contradiction detection (Haiku 0.870, Flash 0.869, GPT-4o 0.864). No model exceeds 0.74 on L6 omission. Every model loses 24–60 pp under an "already-verified" preamble — no model is sycophancy-immune. Gemini family refuses under 70% of hallucination probes; Claude + GPT-4o refuse ≥88%.
April 22, 2026 · 14 MIN READ
Read paper
....:---. . .:.
-=-:... .
.-:. .:. .
.. . :==:. .... ....
-. . . :==.........
.:... :=+++=-. :.
. :+=++*+=-. . .
.. ...:--=-::.. ... .
. ...:........ .....::.
..==-.. .... ........ . .:=-=::... .
..-=*#+-..... .. . .:=+*+=.. .
. ..:.-++++.::. .... ... ...:-+- ..
.. .....::=*-. ....... .. . ..=: ..
......... :: . . . ....... : ...
... .. .. .: ..-:: . .. ...:===-:.
.... . ..........:-:. ..-=--=++++=---+:
... .. ...........-=-:....::...:. .+--
.... ... .......==: :===:... ......:::
..... ...... .....:+*+- :-===:.... .:==:=+
..... .. .. . ..:-==+++-::-===-: ..-:-=+=-
............ ...:=+=+:....:-=- .:=+*=-:.
. ........... ....:+**. ........-+*+-:::.
........... .. .....++........:+=-:...::.
.... ..... .....-:.......:-:..::--::.
.. . ... .....-. ........ .:-::..:.
........... . .....:....... .:::.... .:
........... ...............:: .:-
..::......... . ................. .---
...-::... . . . .................. .:--=
....::. . ....::----::....... .::---
.................:-=====-:......... :----=
..........:----====---:.:......... :--:--
...........::.......:............... .:----
.................................... .::--
....................-=--........... ::--
.............-==+++++=:............. .::
..........:-++***+=---.............. .:
.........-==+++=-=-::-:-:.-=-:::::-:::. .
........------:........:-:---===-:::..
=:....................::-::..::...:: .. A paired 1-frame vs 8-frame ablation finds no benefit from multi-frame context on temporal or causal reasoning. Claude Sonnet 4.5 wins the composite at 0.446; on adversarial error detection Claude models catch 85–97% of procedural errors vs 38–76% for GPT-4o.
March 24, 2026 · 13 MIN READ
Read paper
.... .
..... .
...... ..
.... ..
.
. ...
.......::.
......:.::::.
....:::::::----.
....:::::::-:-----:
...:::::---------===.
...::----==-=====+++=+- .
.....:------====+++++++*++*+ .
.::--------==-==+==++++++*#= ....:--::::-=
.:-----------------====+=+***-==*##%%%%%#####
.:--------------::----==+***##%%%%%%%%%%##*#%
:--------------------====+==+#%%%%%%%%%%**#%%
::-----------------------=--=*%%%%%%%%%#++*%%
:::::---==-------=====-====++*#%%%%%%%%#+-+%%
::---:------=---=====+++=+*++*#%%####%##**#%%
:::------==---=======+*####---+***#%%%%%@%%%@
.:::::-::--=====++==+#%##*****+**=++#########
::::::::--::----:-=-####+=-=*****+*+=#######%
====+++++*=--:::--:-=+**:=+++**##***+++*+**+*
++*********+-:::::--=**=:--=++*++**+***#**--=
***###**+=-:....::--+#%=-::-===+----=**+*#+--
*##*+=:...... ...:-*#=-::-====-----==-=+***
#*=:......... ......:--:::--=#*+--------+=*+
###*+=-====-:.......:::..::---++**+==++------
####**#####%###*-:....::. ....:-====-===----=
####**##++#####*****+=+****=::....::=+===----
*+++=-::::--::-=***#%%%##%%%#%*+-:..::--====-Three of five dedicated ASR providers cannot serve the low-resource tier reliably. Deepgram Nova-2 and Gemini 2.5 Flash silently return empty transcripts on 49.3% and 28.7% of Mandarin–English code-switched utterances. ElevenLabs Scribe is the only provider with 100% coverage across all 20 languages on cultural-QA audio.
March 10, 2026 · 12 MIN READ
Read paper
==---=+::--=-:::.....::::--==++*-.:-----::--- ::-::==--=-:.:::.....::---++****:.-=-=-:----= ::::---:::::........::--=+***++-.:---::-===== ----::::-::::.::....::-+++*****=..:-:::-===++ ***+-::::-:.:.:.....:-+******+-.....:--=++++* **++++=-:.:..::....::=*****++-....:::--+=++++ +++***=-:..::::....:-++++*++=.....::::=++++++ +*+*++====--::....::=+++**+-.. ...:::-+**+*+= ++++====++-::....::-++*++=:. ...:::-=+++++++ +++++++*#+-:.....::++***+:. ...::::-======= ==++++++*=:.....:-=***+=:. .....:..:-=+=+=- +***+++*+-:....:-+**+*+-.........:::---=++=-: ++++=+=+-::....:+*++=-....:::-=-:::-+====-::: ========-::.:-:-+++=....:::::-+*=:::::::::::: =====-------=++===-:::.:::::::=*=:::::::::::: --=--------===++++-::::::::::-*+--=:::::::::: ::::::::::::--==++-:::-:::::-=+--::::::::::-- =-------::::::--+++=--:--::--+=:::::::::----- ++====--::::::::-+++=-----::==::.:::::::::--= *+==-:::::::::::-++++=-::::-=::::::-:::::--== =+**+--::::::::::-++++=-:::-:.:-:=-::::::-=== -++===+*=-:::::::-+++++----.:::-::-=-----===+ ===---:--==---:::-==+++=+=-:.:::--=--------== +===-------:::----=*+==+++:::.:::-==-====---- ====-:-::--::::::-=+++++=--:::--:----------:: =++=::::----:::::-:-=+==+-=--=-=---::::--:::: --:...:::---:-:::--===+=++=+=---==------:--:: .....=++-:-==---=--===*+==-------======--:::: .........:--=-==--------::::::::----======--: :::......---:-===-::::::::-------=++========- -::-::::.::--:--===::----=--:---+*+*++===++== ===+=-=-.:-:..:::--:--======-=*++====+===+==- =--===--:::..::.::::-==----=-====------====== +=++--::::.:::::::::-----==+=====-::-------== +****=---:.:-::-:::::--==========+===-------- =+++=+=:-::-:::::::::-==+=====+======-==----- --=-:::::::::::::.:::-===-=-=-----------===-- ====-:::::::--:::::::------------------=====- =====-::::-++=---:::---:----::::::----=====-- +===--==--=+=-::::::--::::-==-==------=+====-
On transcription, ElevenLabs Scribe (0.408) and Gemini 2.5 Pro (0.411) tie at the top with overlapping CIs. On SLURP intent the best text-reasoner control beats the best audio-native MLLM by 27 pp F1, and on AMI reasoning by 12 pp — audio understanding, not reasoning, is the dominant source of error.
February 24, 2026 · 11 MIN READ
Read paper
Focus Areas
Quality human text is finite, with public estimates placing the usable stock around hundreds of trillions of tokens and exhaustion risk in the late 2020s. We study how data scarcity and synthetic contamination reshape the training landscape.
The highest-value training signal isn't answers. It's the reasoning behind them. We research how professional decision traces in medicine, law, and engineering transfer to model capability.
How do paired modalities like voice, vision, text, and sensor data interact during training? We investigate cross-modal data composition and the capability emergence it produces.
Licensed, consented data on how experts use tools barely exists. We study tool-use traces, error recovery patterns, and multi-step task completions, where small seed datasets yield outsized capability gains.
Models trained purely on synthetic outputs collapse. We research the optimal interplay between human-sourced data and synthetic augmentation: the flywheel that drives frontier performance.
Frontier models still degrade sharply on underrepresented languages and culturally specific speech. We study how linguistic diversity and cultural context shape model behavior at scale.
Our Approach
01
We identify where models systematically fail, from 35% accuracy on underrepresented language benchmarks to missing decision traces in professional domains like debugging, diagnosis, and legal reasoning.
02
We construct targeted datasets with domain practitioners: expert reasoning traces, multi-modal paired data, and agentic task demonstrations that encode the knowledge synthetic generation cannot replicate.
03
Every intervention is evaluated against domain-specific benchmarks, not generic leaderboards. We publish our findings on data composition, human-synthetic flywheels, and capability emergence.
Teams
Data Science studies data quality for multimodal AI: measuring signal density, mapping distributional gaps, and understanding why models built on licensed expert reasoning outperform those built on volume alone.
Multi-Modal Systems investigates how models learn from heterogeneous data, studying cross-modal transfer, paired data composition, and the capability gains that multi-modal training can produce over single-modality approaches.
Agentic Intelligence researches the data infrastructure for AI agents: tool-use traces, error recovery patterns, and GUI interaction data, where 312 human demonstrations can be augmented to 27,000 training instances with 141% capability improvement.
Evaluation & Benchmarks builds domain-specific evaluation frameworks that expose failures generic benchmarks miss, measuring model performance in professional contexts across medicine, law, finance, and multilingual settings.
Collaborate
We partner with frontier labs, domain experts, and applied teams. Tell us what you’re studying.