Crowdsourcing Beyond Annotation: Case Studies in Benchmark Data Collection

An EMNLP 2021 tutorial by Alane Suhr, Clara Vania, Nikita Nangia, Maarten Sap, Mark Yatskar, Sam Bowman, and Yoav Artzi.

The slides are available here. Our pre-tutorial teaser video is available here.

Crowdsourcing from non-experts is one of the most common approaches to collecting data and annotations in NLP. It has been applied to a plethora of tasks, including question answering, instruction following, visual reasoning, and commonsense reasoning. Even though it is such a fundamental tool, crowdsourcing use is largely guided by common practices and the personal experience of researchers. Developing a theory of crowdsourcing use for practical language problems remains an open challenge. However, there are various principles and practices that have proven effective in generating high quality and diverse data. The goal of this tutorial is to expose NLP researchers to such data collection crowdsourcing methods and principles through a detailed discussion of a diverse set of case studies.

The video embedded below is a playlist. Click the top-right menu icon to open the playlist navigation pane to navigate between sections.