Romanization with Friends: Deciphering Informally Romanized Text
One model’s noise is another model’s data
Informal romanization is an idiosyncratic way of typing non-Latin-script languages in Latin alphabet, common on social media and in other online communication. Although each user has their own character substitution preferences, these choices are typically grounded in shared perceptions of visual and phonetic similarity between characters. In this talk, I will focus on the task of converting such romanized text into its native orthography for Russian, Egyptian Arabic, and Kannada, showing how similarity-encoding inductive bias helps in the absence of parallel data. I’ll also share some insights into the behaviors of the unsupervised finite-state and seq2seq models for this task and discuss how their combinations can leverage their different strengths.
Maria Ryskina is a PhD student at the Language Technologies Institute at Carnegie Mellon University. Her research focuses on non-standard language such as novel words and non-standard spellings and how such innovation on an individual level drives larger-scale language change.