Online platforms are under increased pressure to moderate or remove hate speech or toxic language, and are forced to rely on automatic filters as a first pass due to the volume of content they have to deal with. Unfortunately, as I’ll discuss in this talk, automated methods can really backfire by learning unwanted correlations between offensiveness and lexical cues (e.g., minority mentions, slang or dialect indicators), partially explained by annotator biases. I’ll walk through possible data filtering solutions and discuss their effectiveness at fixing the toxic language detection system’s biases (spoiler alert: it’s not great). I’ll finish with a possible alternative to hate speech labelling, and open up the discussion to the ethical conundrum of automating toxic language detection vs. tasking humans with the removal of this content (this should be fun 🙃).
Maarten Sap is (almost) a 6th/final year PhD student in NLP at the University of Washington, working with Noah Smith and Yejin Choi. His work tackles social commonsense reasoning as well as social good/social justice applications of NLP.