Lexicon Based Critical Tokenisation: An Algorithm

By admynNovember 17, 2016Euralex 1998 Part 1, Publications

Page	213-220
Author	Jon Mills
Title	Lexicon Based Critical Tokenisation: An Algorithm
Abstract	In some languages, spaces and punctuation marks are used to delimit word boundaries. This is the case with Cornish. However there is considerable inconsistency of segmentation to be found within the Corpus of Cornish. The individual texts that make up this corpus are not even internally consistent. The first stage in lemmatising the Corpus of Comish, therefore, involves the resegmentation of the corpus into tokens. The whole notion of what is considered to be a word has to be examined. A method for the logical representation of segmentation into tokens is proposed in this paper. The existing segmentation of the Corpus of Cornish, as indicated by spaces in the text, is abandoned and an algorithm for dictionary based critical tokenisation of the corpus is proposed.
Session	PART 2 - Computational Lexicology and Lexicography
Keywords	lexicography, tokenisation, lemmatisation, segmentation, Comish
BibTex	@InProceedings{ELX98_1-026, author = {Jon Mills}, title = {Lexicon Based Critical Tokenisation: An Algorithm}, pages = {213-220}, booktitle = {Proceedings of the 8th EURALEX International Congress}, year = {1998}, month = {aug}, date = {4-8}, address = {Liège, Belgium}, editor = {Thierry Fontenelle, Philippe Hiligsmann, Archibald Michiels, André Moulin, Siegfried Theissen}, publisher = {Euralex}, isbn = {2-87233-091-7}, }
Download