RIDIRE. Corpus and Tools for the Acquisition of Italian L2

By November 17, 2016,
Page 447-462
Author Alessandro Panunzi, Emanuela Cresti, Lorenzo Gregori
Title RIDIRE. Corpus and Tools for the Acquisition of Italian L2
Abstract This paper introduces the RIDIRE corpus, built by means of an open source tool (RIDIRE-CPI) for creating specifically designed web corpora through a targeted crawling strategy. The RIDIRE-CPI architecture combines existing open source tools with specifically developed modules, comprising a robust crawler, a user friendly web interface, several conversion and cleaning tools, an anti-duplicate filter, a language guesser, and a PoS-tagger. The RIDIRE corpus is a balanced Italian web corpus (1.5 billion tokens) designed for enhancing the study of Italian as a second language, while also being exploitable for lexicographic purposes. The targeted crawling was performed through content selection, metadata assignment, and validation procedures. These features allowed the construction of a large corpus with a specific design, covering a variety of language usage domains (News, Business, Administration and Legislation, Literature, Fiction, Design, Cookery, Sport, Tourism, Religion, Fine Arts, Cinema, Music). The RIDIRE query system allows research to be carried out on the whole corpus itself or on the sub-corpora. Specifically, available queries comprehend all the functions usually exploited in corpus-based lexicography: frequency lists, concordances and patterns, collocations, Sketches, and Sketch Differences.
Session Lexicography and Corpus Linguistics
Keywords Corpus linguistics; Terminology; Collocations
author={Alessandro Panunzi and Emanuela Cresti and Lorenzo Gregori},
title={RIDIRE. Corpus and Tools for the Acquisition of Italian L2},
booktitle={Proceedings of the 16th EURALEX International Congress},
address={Bolzano, Italy},
editor={Abel, Andrea and Vettori, Chiara and Ralli, Natascia},
publisher={EURAC research},