Pratik Joshi and Sebastin Santy

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Through the lens of our taxonomy, you can see mountains of technical progress, but also thousands of languages on the ground without a way to climb them.

Language technologies have the potential to promote multilingualism and linguistic diversity, and can bring down communication and technological barriers around the world. However, only a handful of languages from over 7000 languages of the world are represented in these technologies and applications. In this talk, we discuss our quantitative analyses which centre on the disparity between languages in terms of language resources and research. We define a taxonomy based on 2 simple dimensions to highlight this disparity, and then go into subsequent findings around typological representations in resources, language representation in NLP conferences, and trajectories that different languages have followed over time (with relation to iterations of NLP conferences over time). We then investigate how the taxonomy reflects throughout these analyses, and some interesting takeaways from our study.

Finally, we’ll wrap up with some ways we think this disparity can be reduced, and briefly go over some other projects where we either encountered or attempted to tackle the issue of language technology gaps.


Pratik is currently a Research Engineer at Google Research India, previously a Research Intern at Microsoft Research India. His work has revolved around multilingual systems and probing natural language understanding models. He is due to join CMU for his master’s next year.

Sebastin is an AI Center Fellow at Microsoft Research India. His work includes on building tools/interfaces to increase access around language technologies. He has also worked on analyses of challenges revolving around low resource languages.

Presentation Materials

ACL2020 Paper