Corpus Similarity and Homogeneity via Word Frequency

Author Adam Kilgarriff, Raphael Salkie
Abstract A measure of corpus similarity would be very useful for lexicography. Word frequency lists are cheap and easy to generate so a measure based on them can be used where a detailed comparison of the two corpora is not viable, for example , to judge how a new corpus relates to already familiar ones. We show that corpus similarity can only be interpreted in the light of corpus homogeneity, and present a measure, based on the chi-square statistic, for measuring both corpus similarity and corpus homogeneity.
Session PART 1 - Computational Lexicology and Lexicography
