Abstract |
In this paper, we describe Lexicon Creator, a tool designed to help developers produce lexical data for its use in a variety of linguistic applications such as spell-checkers, wordbreakers, thesauri, etc. The tool enables developers to work on existing wordlists derived either directly from corpora or from previously created wordlist data. The key feature of the tool is that it enables linguists to rapidly create the morphological rules that are necessary to generate all the inflected forms of a given item. In many languages, a given word may have many forms, each distinguished by different endings attached to the stem of the word. A language like English is rather simple, morphologically-the verb walk only has the following forms: walk, walks, walked, walking, while other languages may have a number of different forms for a word. Yet, it is essential to create lexicons that can recognize and generate all the inflected forms of a given word, especially for applications such as spell-checkers-where overgeneration should be avoided, thesauri, grammar checkers, morphological analyzers/generators, speech recognition, and handwriting recognizers. It would be extremely time-consuming to code each of these forms individually, so it is necessary to develop this data more efficiently. Lexicon Creator allows linguists to classify these variations of the same word into templates, or morphological classes, which allow the automatic generation of all valid forms of a word. Once the templates describing the aforementioned variations have been defined, the data-coding task consists of assigning an input word to the correct template and checking that the forms generated automatically are valid. The article will also focus on the additional types of linguistic information which can be attached to words, depending on the intended application that will use the resulting full-form lexicon. |
BibTex |
@InProceedings{ELX08-019, author = {Thierry Fontenelle, Nick Cipollone, Mike Daniels, Ian Johnson }, title = {Lexicon Creator: A Tool for Building Lexicons for Proofing Tools and Search Technologies}, pages = {359-369}, booktitle = {Proceedings of the 13th EURALEX International Congress}, year = {2008}, month = {jul}, date = {15-19}, address = {Barcelona, Spain}, editor = {Elisenda Bernal, Janet DeCesaris}, publisher = {Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra}, isbn = {978-84-96742-67-3}, } |