Abstract |
A measure of corpus similarity would be very useful for lexicography. Word frequency lists are cheap and easy to generate so a measure based on them can be used where a detailed comparison of the two corpora is not viable, for example , to judge how a new corpus relates to already familiar ones. We show that corpus similarity can only be interpreted in the light of corpus homogeneity, and present a measure, based on the chi-square statistic, for measuring both corpus similarity and corpus homogeneity. |
BibTex |
@InProceedings{ELX96_1-016, author = {Adam Kilgarriff, Raphael Salkie}, title = {Corpus Similarity and Homogeneity via Word Frequency}, pages = {117-127}, booktitle = {Proceedings of the 7th EURALEX International Congress}, year = {1996}, month = {aug}, date = {13-18}, address = {Göteborg, Sweden}, editor = {Martin Gellerstam, Jerker Järborg, Sven-Göran Malmgren, Kerstin Norén, Lena Rogström, Catalina Röjder Papmehl}, publisher = {Novum Grafiska AB}, isbn = {91-87850-14-1}, } |