Research

Where frontier models break,
and what we are doing about it.

Every paper measures a specific failure pattern in a specific domain. We publish the full methodology, the dataset, and the statistical significance tests alongside every claim.

Research Notes

Benchmarks and findings.

%%%%%%%%#-::::-----======-----+#%%%%%%%%%%%#######
%%%%%%%%+:-----------------++#%%%%%%%%%%%%%%######
%%%%%%%+---::-----------==#%%%%%%%%%%%%%%%%%%#####
%%%%%%+---:::::::------+##%%%%%%%%%%%%%%%%%%######
%%%%%#-=---=++*#%#++=-=%%%%%%%%%%%%%%%%%%%%%%%####
%%%%%%%%%%%%%%%%%#+==-#%%%%%%%%%%%%%%%%%%%%%%%%%##
%%%%%%%%%%%%%%%%%*+**%%%%%%%%%%%%%%%%%%%%%%%%%%%%#
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#%%%%%%%%%%%%##+=-=*
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%=*%%%%%%#####**=-:-*
+*%%%%%%%%%%%%%%%%%%%%%%%##%%#--#%%%####*#********
*#%%%%%%%%%%%%%%%%%%%%%%%+*%*:-:-###%%%%%%%#######
*#%%%%%%%%%%%%%%%%%%%%%%###*:=+-:..-#%%%%%%%%%%%%%
#%%%%%%%%%%%%%%%%%%%%%%%**+::....:=##%%###%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%*+*-:....:=*%#**#%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%*=--:::::-=##++##%%%%%%%%%
%%%%%%%%%%%%%%####%%%%%%*---:::::--+=+*##%%%%%%%%%
%%%%%%%%%%*+-:.:::-=#%%%#---:------+**##%%%%%%%%%%
%%%%#+-::...::::::--=*#%%#=------=**###%%%%%%%%%%%
#%%+:....:-+-::::---+*##%%%=---=#%%%%%%%%%%%%%##%%
#%%::::::--------=-=+++*#%#%%%#+####%%%%%#*%%%%%##
%%%*:::---====--===--=+**#%%%%%%##******+-+%%%%%#%
%%%%+:::--====------=++*##%%%%%%%#*****+--#%%%%%%%
@%%%%*::::--------=-=++*#%%%%%%%%#**+=+=--*%%%%%%%
@@@%%%#-:::::::--++++*#%%%%%%%%%%%*+=-==---*%%%%%#
@@%%%%%%::::----+##%%%%%%%%%%####*+--=+=---=%%%%*=
@%%%%%%#:-==-===*%%%%##%%%%%##**=---:::--===#%*=--
%%%##%%#=+++++-==*%%#*%%%%%%##*+==----===+--#*:::-
*****++***==++=====+++++******+=====+====+==**==--
*#**++++**========+===+*###*#*+==+++++=+===-******
******++++===========**********==++++++=====+++++*
***********+========+*********+===++========-::---
**+*++******+==---==+***##**##**==---===-=+++++===
****+++****#**===---==+****+**++++=------+***+==--
***********#*#*+++====++******++++++=++=====++----
********#***+*************+=====++++=+++***+=+=---
************+++++++++++++++++=+++=++++++*+=+++++=-
##****#*****++++++++*+++***+++=======++=++===-=+==
##############****+=+*+*******++++++=+++==-=++++++
#########%%%#*+=+#*=+=+#######****#**#****+++++***
######%######**##*=-=*#########*####*##*##*#*#*##*
#*#####%##%##*=-==--+*#########*#######*##########
###%###%%####*+==--:-+##*###################**####
Paper № 05Multilingual

Frontier voice AI on Indian languages

ElevenLabs Scribe v2 leads at WER 0.277 over 10 Indic languages. Script Fidelity Rate catches a failure invisible to WER: Whisper large-v3 Latin-transliterates 83% of Odia audio; AssemblyAI Universal-3 Pro script-collapses five Indic languages to romanized Latin. The strongest Indic specialist (Sarvam Saaras v3) is statistically tied with Deepgram Nova-3 on aggregate WER, and OpenAI's audio-native transcribe variants trail dedicated ASR by 0.10–0.14 WER — frontier multimodal models do not yet subsume specialized ASR on Indian languages.

April 16, 2026 · 12 MIN READ

Read paper

----:::::..................................::
--:::::::...........................:::...:::
::::..........................:::::::::::::::
:::.........................::::::::::::::---
:::.........::::::.......::::------------==++
:::..........::::::....:::::-------==++++**#*
::...........:::::::::::----===++++*##%%@@@%@
:......:...:::::::::------=====+++*#########%
:...::.:...:::::::::::::----::----===--=----=
................:::.....:::-----:::::::::::::
..............::::.....::::::-----:::::::::::
.............:::............:::----::::::::::
.............................:::::......:::::
......:::...............................:.:::
....................::::.............:..:::::
..................::::::..............:=---::
.................:::::::..............-:::::.
.....................:.........:===+=-:......
:...............................:::::-:.::...
::..........::................:--::::.   ::..
:.......::..:::=-==..........................
.  .::::::::::::::::.................::.... .
   ........  .  ..:-==----:................  
  .:... ..      .....-::::.............. ..  
  .:.:.... .. .......:::--::::...............
.. ..............  .:::---:::::........:--=-:
..      .-:----... ..:..:..:.:::.  .....::::.
. :.   :---::==-=..::.  .....:::.............
..:::::::-::::--.....:.   ..-===+=-:-....:=--
.::::::::..::::........   .:::---::......:--=
.:. ..      .....:........::::.....:::::::.:-
=-:.      ...............::-:---::-----------
+++:.  ................:.:::::--========---==
+=*=*---:. ...............:---------======-==
==+*#*#*:......::::----------==+=============
==#=*:=-    .----=-===++===+++++===+++++++++=
:--       ..-====+**++++++++++++=============
   .  .  ...:***#****************+++++++++===
  .:.:......:*####**#######**######***+++++==
 ::::.......:*##################**#*#***+++++
Paper № 04Safety

Cross-modal consistency verification in video models

Three frontier models from three vendors tie at ~87% overall contradiction detection (Haiku 0.870, Flash 0.869, GPT-4o 0.864). No model exceeds 0.74 on L6 omission. Every model loses 24–60 pp under an "already-verified" preamble — no model is sycophancy-immune. Gemini family refuses under 70% of hallucination probes; Claude + GPT-4o refuse ≥88%.

April 22, 2026 · 14 MIN READ

Read paper

  ....:---.  .         .:.                   
     -=-:...               .                 
   .-:.              .:.         .           
   ..        .       :==:. .... ....         
-.  .            .    :==.........           
 .:... :=+++=-.         :.                   
    .    :+=++*+=-.      .      .            
          .. ...:--=-::..    ...            .
             .  ...:........  .....::.       
       ..==-..  .... ........ .  .:=-=::... .
       ..-=*#+-..... .. .        .:=+*+=..  .
     . ..:.-++++.::.  ....  ...  ...:-+-  .. 
     .. .....::=*-.    ....... .. . ..=:   ..
      ......... :: .   .  . .......   :   ...
...    ..    .. .: ..-::      . .. ...:===-:.
....    .    ..........:-:. ..-=--=++++=---+:
...       ..  ...........-=-:....::...:. .+--
....      ...  .......==: :===:...  ......:::
.....  ......   .....:+*+- :-===:.... .:==:=+
.....  .. ..  .  ..:-==+++-::-===-: ..-:-=+=-
............      ...:=+=+:....:-=- .:=+*=-:.
. ...........     ....:+**. ........-+*+-:::.
  ...........  ..  .....++........:+=-:...::.
....    .....      .....-:.......:-:..::--::.
..       . ...     .....-. ........ .:-::..:.
........... .      .....:....... .:::....  .:
...........        ...............::      .:-
..::......... .    .................     .---
...-::... . . .   ..................    .:--=
....::.     .    ....::----::.......   .::---
.................:-=====-:.........    :----=
..........:----====---:.:.........     :--:--
...........::.......:...............   .:----
....................................    .::--
....................-=--...........      ::--
.............-==+++++=:.............      .::
..........:-++***+=---..............       .:
.........-==+++=-=-::-:-:.-=-:::::-:::.     .
........------:........:-:---===-:::..       
=:....................::-::..::...::  ..     
Paper № 03Video

A five-axis benchmark for procedural video understanding

A paired 1-frame vs 8-frame ablation finds no benefit from multi-frame context on temporal or causal reasoning. Claude Sonnet 4.5 wins the composite at 0.446; on adversarial error detection Claude models catch 85–97% of procedural errors vs 38–76% for GPT-4o.

March 24, 2026 · 13 MIN READ

Read paper

 ....                                       .
 .....                                      .
......                                     ..
....                                       ..
                                             
                                             
                                             
                                             
                                          .  
                                             
                                             
                                             
                                             
              . ...                          
             .......::.                      
            ......:.::::.                    
          ....:::::::----.                   
        ....:::::::-:-----:                  
       ...:::::---------===.                 
     ...::----==-=====+++=+-                .
.....:------====+++++++*++*+                .
 .::--------==-==+==++++++*#=   ....:--::::-=
.:-----------------====+=+***-==*##%%%%%#####
.:--------------::----==+***##%%%%%%%%%%##*#%
:--------------------====+==+#%%%%%%%%%%**#%%
::-----------------------=--=*%%%%%%%%%#++*%%
:::::---==-------=====-====++*#%%%%%%%%#+-+%%
::---:------=---=====+++=+*++*#%%####%##**#%%
:::------==---=======+*####---+***#%%%%%@%%%@
.:::::-::--=====++==+#%##*****+**=++#########
::::::::--::----:-=-####+=-=*****+*+=#######%
====+++++*=--:::--:-=+**:=+++**##***+++*+**+*
++*********+-:::::--=**=:--=++*++**+***#**--=
***###**+=-:....::--+#%=-::-===+----=**+*#+--
*##*+=:......   ...:-*#=-::-====-----==-=+***
#*=:.........  ......:--:::--=#*+--------+=*+
###*+=-====-:.......:::..::---++**+==++------
####**#####%###*-:....::. ....:-====-===----=
####**##++#####*****+=+****=::....::=+===----
*+++=-::::--::-=***#%%%##%%%#%*+-:..::--====-
Paper № 02Fairness

The language equity gap in multilingual speech AI

Three of five dedicated ASR providers cannot serve the low-resource tier reliably. Deepgram Nova-2 and Gemini 2.5 Flash silently return empty transcripts on 49.3% and 28.7% of Mandarin–English code-switched utterances. ElevenLabs Scribe is the only provider with 100% coverage across all 20 languages on cultural-QA audio.

March 10, 2026 · 12 MIN READ

Read paper

==---=+::--=-:::.....::::--==++*-.:-----::---
::-::==--=-:.:::.....::---++****:.-=-=-:----=
::::---:::::........::--=+***++-.:---::-=====
----::::-::::.::....::-+++*****=..:-:::-===++
***+-::::-:.:.:.....:-+******+-.....:--=++++*
**++++=-:.:..::....::=*****++-....:::--+=++++
+++***=-:..::::....:-++++*++=.....::::=++++++
+*+*++====--::....::=+++**+-.. ...:::-+**+*+=
++++====++-::....::-++*++=:.  ...:::-=+++++++
+++++++*#+-:.....::++***+:.   ...::::-=======
==++++++*=:.....:-=***+=:.   .....:..:-=+=+=-
+***+++*+-:....:-+**+*+-.........:::---=++=-:
++++=+=+-::....:+*++=-....:::-=-:::-+====-:::
========-::.:-:-+++=....:::::-+*=::::::::::::
=====-------=++===-:::.:::::::=*=::::::::::::
--=--------===++++-::::::::::-*+--=::::::::::
::::::::::::--==++-:::-:::::-=+--::::::::::--
=-------::::::--+++=--:--::--+=:::::::::-----
++====--::::::::-+++=-----::==::.:::::::::--=
*+==-:::::::::::-++++=-::::-=::::::-:::::--==
=+**+--::::::::::-++++=-:::-:.:-:=-::::::-===
-++===+*=-:::::::-+++++----.:::-::-=-----===+
===---:--==---:::-==+++=+=-:.:::--=--------==
+===-------:::----=*+==+++:::.:::-==-====----
====-:-::--::::::-=+++++=--:::--:----------::
=++=::::----:::::-:-=+==+-=--=-=---::::--::::
--:...:::---:-:::--===+=++=+=---==------:--::
.....=++-:-==---=--===*+==-------======--::::
.........:--=-==--------::::::::----======--:
:::......---:-===-::::::::-------=++========-
-::-::::.::--:--===::----=--:---+*+*++===++==
===+=-=-.:-:..:::--:--======-=*++====+===+==-
=--===--:::..::.::::-==----=-====------======
+=++--::::.:::::::::-----==+=====-::-------==
+****=---:.:-::-:::::--==========+===--------
=+++=+=:-::-:::::::::-==+=====+======-==-----
--=-:::::::::::::.:::-===-=-=-----------===--
====-:::::::--:::::::------------------=====-
=====-::::-++=---:::---:----::::::----=====--
+===--==--=+=-::::::--::::-==-==------=+====-
Paper № 01Evaluation

Frontier voice models on professional speech

On transcription, ElevenLabs Scribe (0.408) and Gemini 2.5 Pro (0.411) tie at the top with overlapping CIs. On SLURP intent the best text-reasoner control beats the best audio-native MLLM by 27 pp F1, and on AMI reasoning by 12 pp — audio understanding, not reasoning, is the dominant source of error.

February 24, 2026 · 11 MIN READ

Read paper

Focus Areas

Understanding the foundations of AI capability.

The Data Wall

Quality human text is finite, with public estimates placing the usable stock around hundreds of trillions of tokens and exhaustion risk in the late 2020s. We study how data scarcity and synthetic contamination reshape the training landscape.

Expert Reasoning Traces

The highest-value training signal isn't answers. It's the reasoning behind them. We research how professional decision traces in medicine, law, and engineering transfer to model capability.

Multi-Modal Composition

How do paired modalities like voice, vision, text, and sensor data interact during training? We investigate cross-modal data composition and the capability emergence it produces.

Agentic Data Systems

Licensed, consented data on how experts use tools barely exists. We study tool-use traces, error recovery patterns, and multi-step task completions, where small seed datasets yield outsized capability gains.

Human-Synthetic Flywheels

Models trained purely on synthetic outputs collapse. We research the optimal interplay between human-sourced data and synthetic augmentation: the flywheel that drives frontier performance.

Multilingual Representation

Frontier models still degrade sharply on underrepresented languages and culturally specific speech. We study how linguistic diversity and cultural context shape model behavior at scale.

Our Approach

Empirical, rigorous, applied.

01

Map distributional gaps

We identify where models systematically fail, from 35% accuracy on underrepresented language benchmarks to missing decision traces in professional domains like debugging, diagnosis, and legal reasoning.

02

Design data interventions

We construct targeted datasets with domain practitioners: expert reasoning traces, multi-modal paired data, and agentic task demonstrations that encode the knowledge synthetic generation cannot replicate.

03

Measure and publish

Every intervention is evaluated against domain-specific benchmarks, not generic leaderboards. We publish our findings on data composition, human-synthetic flywheels, and capability emergence.

Teams

Our research is organized around four core teams.

Data Science studies data quality for multimodal AI: measuring signal density, mapping distributional gaps, and understanding why models built on licensed expert reasoning outperform those built on volume alone.

Multi-Modal Systems investigates how models learn from heterogeneous data, studying cross-modal transfer, paired data composition, and the capability gains that multi-modal training can produce over single-modality approaches.

Agentic Intelligence researches the data infrastructure for AI agents: tool-use traces, error recovery patterns, and GUI interaction data, where 312 human demonstrations can be augmented to 27,000 training instances with 141% capability improvement.

Evaluation & Benchmarks builds domain-specific evaluation frameworks that expose failures generic benchmarks miss, measuring model performance in professional contexts across medicine, law, finance, and multilingual settings.

Collaborate

Working on something at the data frontier?

We partner with frontier labs, domain experts, and applied teams. Tell us what you’re studying.