Sabrina J. Mielke

Fair comparisons for generative language models—with a bit of Information Theory

A story of anger-driven development: yes, you can compare perplexities, no, not like that.

One morning I awoke from uneasy dreams to find a paper that claimed it couldn’t compare to previous work since segmentation differed—and in anger started writing a blogpost about proper comparisons.

The bulk of this talk will be dedicated to two papers that try to do exactly that: fairly compare performance of generative models through using probabilistic and information theory-based measures. The first paper details how to evaluate (monolingual) open-vocabulary language models by total bits and the second ponders the meaning of “information” and how to use it to compare machine translation models. In both cases, we get only a little glimpse at what might make languages easier or harder for models, but deviating from the polished conference talk, I will recount how I spent half a year on a super-fancy model that yielded essentially the same conclusions as a simple averaging step…

The rest of the talk will be dedicated to a short overview of two other papers I don’t hate enough yet, one actually building new open-vocabulary language models, and the second using such models to evaluate and ameliorate gender bias in morphologically rich language. I’ll lead into discussion by teasing some other stuff including current work-in-progress on metacognition.


Sabrina J. Mielke is a PhD student at Johns Hopkins University, working on things like open-vocabulary language modeling and fair cross-linguistic evaluation. She has a dark formal language theory past and patiently waits for the revival of tree transducers in NLP. Her favorite part of the job until then? Making figures and slides.

Presentation Materials

Talk Video
Slides