David Adelani

Development of NLP datasets and models for African Languages

September 23, 2020 14:00 UTC

“Everyone can build a model for an African language, no one can evaluate it like Masakhane can!” – Jade Abbott

In recent years, deep learning models have been very successful for many natural language processing tasks including machine translation, text generation, information extraction, and dialogue understanding. However, many of these models are only evaluated on English language and other high-resourced languages because of the availability of large unlabelled texts and numerous labelled datasets which are absent in low-resourced African languages. But these high resourced languages are only a few dozens, concentrated in a few regions of the world with a lot of similarities which limit the generalization of these models to low resourced languages.

In this talk, I will discuss some of the challenges of working on low-resourced languages including non-availability of training data and data quality issues, and also the development of word embeddings and labelled datasets for African languagues. First, I will compare the quality of word embeddings for Twi and Yoruba trained on a large online multilingual resources such as Wikipedia and CommonCrawl with word embeddings trained on a small curated corpora, and analyze the noise in publicly available corpora. Second, I will discuss some techniques for addressing the lack of labelled training data such as distant and weak supervision (e.g rules by native speakers), and transfer learning with a focus on named entity recognition for Hausa and Yoruba languages. In conclusion, I will discuss our effort (with Masakhane NLP) in developing named entity recognition datasets for over 10 African languages.

David Ifeoluwa Adelani is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialog systems and online social interactions. He is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with special focus on African languages.

Presentation Materials

Talk Video
LREC2020 Paper