Corpus Similarity and Homogeneity via Word Frequency

AuthorAdam Kilgarriff, Raphael Salkie
TitleCorpus Similarity and Homogeneity via Word Frequency
AbstractA measure of corpus similarity would be very useful for lexicography. Word frequency lists are cheap and easy to generate so a measure based on them can be used where a detailed comparison of the two corpora is not viable, for example , to judge how a new corpus relates to already familiar ones. We show that corpus similarity can only be interpreted in the light of corpus homogeneity, and present a measure, based on the chi-square statistic, for measuring both corpus similarity and corpus homogeneity.
SessionPART 1 - Computational Lexicology and Lexicography
author = {Adam Kilgarriff, Raphael Salkie},
title = {Corpus Similarity and Homogeneity via Word Frequency},
pages = {117-127},
booktitle = {Proceedings of the 7th EURALEX International Congress},
year = {1996},
month = {aug},
date = {13-18},
address = {Göteborg, Sweden},
editor = {Martin Gellerstam, Jerker Järborg, Sven-Göran Malmgren, Kerstin Norén, Lena Rogström, Catalina Röjder Papmehl},
publisher = {Novum Grafiska AB},
isbn = {91-87850-14-1},