I am a member of technical staff at Cohere, working on model pretraining. For my day job, I think about things such as tokenization, trustworthy evals, signal at small scales, model optimization, and data infrastructure. In the past, I have worked on topics such as (Unicode) character-based language modelling, deep generative models applied to language, and structured meaning representations.

Selected publications

(see my Google Scholar for a full list)

Tokenization and tokenization-free modelling

General language modelling

Miscellaneous