Automatic machine translation (MT) metrics are widely used to distinguish the quality of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). We first investigate how useful MT metrics are at detecting the segment-level quality by correlating metrics with the translation utility for downstream tasks - They are not! In the second part, we take a more holistic view of this investigation by evaluating 22 MT metrics on a contrastive challenge set consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We find that metrics tend to disregard source sentences and fail at ambiguous translations. Further, even neural metrics are considerably influenced by word overlap with the reference; and some properties of multilingual embeddings produce undesirable effects for MT evaluation.
Nikita Moghe is a final year PhD student at the University of Edinburgh working on dialogue systems, multilinguality, and accidentally on machine translation evaluation. When she is not at her desk, she is probably calming her Sitar teacher on why the AI doomsday is not here (…yet)