About

I am a member of technical staff at Cohere, working on model pretraining. For my day job, I think about things such as tokenization, trustworthy evals, signal at small scales, model optimization, and data infrastructure. In the past, I have worked on topics such as (Unicode) character-based language modelling, deep generative models applied to language, and structured meaning representations.

Selected publications

(see my Google Scholar for a full list)

Tokenization and tokenization-free modelling

Unpacking tokenization: Evaluating text compression and its correlation with model performance
Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, Reut Tsarfaty
What is the best recipe for character-level encoder-only modelling?
Kris Cao
You should evaluate your language model on marginal likelihood over tokenizations
Kris Cao, Laura Rimell
A joint model of word embedding and word morphology
Kris Cao, Marek Rei

General language modelling

Command A: An Enterprise-Ready Large Language Model
Cohere
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team
Pitfalls of Static Language Modelling
Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Sebastian Ruder, Dani Yogatama, Kris Cao, Tomas Kocisky, Susannah Young, Phil Blunsom

Miscellaneous

Control prefixes for parameter-efficient text generation
Jordan Clive, Kris Cao, Marek Rei
Towards coherent and consistent use of entities in narrative generation
Pinelopi Papalampidi, Kris Cao, Tomas Kocisky
Game Plan: What AI can do for Football, and What Football can do for AI
Karl Tuyls, Shayegan Omidshafiei, Paul Muller, Zhe Wang, Jerome Connor, Daniel Hennes, Ian Graham, William Spearman, Tim Waskett, Dafydd Steele, Pauline Luc, Adria Recasens, Alexandre Galashov, Gregory Thornton, Romuald Elie, Pablo Sprechmann, Pol Moreno, Kris Cao, Marta Garnelo, Praneet Dutta, Michal Valko, Nicolas Heess, Alex Bridgland, Julien Perolat, Bart De Vylder, Ali Eslami, Mark Rowland, Andrew Jaegle, Remi Munos, Trevor Back, Razia Ahamed, Simon Bouton, Nathalie Beauguerlange, Jackson Broshear, Thore Graepel, Demis Hassabis
Factorising AMR generation through syntax
Kris Cao, Stephen Clark
Emergent communication through negotiation
Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, Stephen Clark