Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
A graph is worth a thousand numbers.
Automatic metrics are a fundamental part of Machine Translation research, serving as a faster and cheaper proxy to human evaluation. They are typically evaluated based on their Pearson correlation with human scores. However, Pearson correlation can mask the presence of different patterns in the data. In the first part of this talk, we investigate the effect of two such patterns in WMT metrics task data: outliers and heteroskedasticity, and answer the questions: (a) how much do outliers influence the correlation of metrics and (b) how does MT system quality influence metric reliability?
We then shift focus on using these metrics to evaluate MT systems. In academia, metrics are often used to make conclusions about two MT systems based on differences in BLEU scores. If MT System A has an improvement of ‘x’ BLEU points on system B, how would humans judge these two systems? And how do other metrics compare to BLEU?
I’ll finish with recommendations for an ideal world, and then open up the discussion.
Nitika Mathur is a PhD candidate in the NLP group at the University of Melbourne, working with Trevor Cohn and Tim Baldwin on machine translation evaluation. Tim once said that her epitaph should be ‘liked making graphs.’