Crowdsourcing Pedagogical Corpora for Lexicographical Purposes

Page 771-779
Author Tanara Zingano Kuhn, Branislava Šandrih Todorović, Špela Arhar Holdt, Rina Zviel-Girshin, Kristina Koppel, Ana R. Luís, Iztok Kosem
Title Crowdsourcing Pedagogical Corpora for Lexicographical Purposes
Abstract Corpora are valuable sources for the development of language learning materials (e.g., books, grammars, dictionaries, exercises), because they contain language as produced in natural contexts. Even though corpora are getting larger, mainly due to crawling data from the web, their pedagogical use remains rather challenging. Not all texts are appropriate for language learning or teaching purposes as they can potentially contain sensitive or offensive content, in addition to exhibit structural problems, errors, among other problems. Corpus cleaning for pedagogical purposes is however a very time-consuming task if done manually. In this paper we present a new and more effective method for creating problem-labelled pedagogical corpora for a group of languages, namely Portuguese, Serbian, Slovene, Dutch and Estonian, by means of crowdsourcing. First, we report on an experiment aimed at verifying the adequacy of crowdsourcing as a technique for corpus labelling. We then outline the lessons learned and discuss how these have led us to explore an alternative way of compiling pedagogical corpora through gamification.
Session Lexicography and Corpus Linguistics Lexicography and Language Tech
Keywords corpus creation; good example sentences; pedagogical corpora; crowdsourcing
