Corpus Similarity and Homogeneity via Word Frequency

By November 17, 2016,
Page 121-130
Author Adam Kilgarriff, Raphael Salkie
Title Corpus Similarity and Homogeneity via Word Frequency
Abstract A measure of corpus similarity would be very useful for lexicography. Word frequency lists are cheap and easy to generate so a measure based on them can be used where a detailed comparison of the two corpora is not viable, for example , to judge how a new corpus relates to already familiar ones. We show that corpus similarity can only be interpreted in the light of corpus homogeneity, and present a measure, based on the chi-square statistic, for measuring both corpus similarity and corpus homogeneity.
Session PART 1 - Computational Lexicology and Lexicography
author = {Adam Kilgarriff, Raphael Salkie},
title = {Corpus Similarity and Homogeneity via Word Frequency},
pages = {117-127},
booktitle = {Proceedings of the 7th EURALEX International Congress},
year = {1996},
month = {aug},
date = {13-18},
address = {Göteborg, Sweden},
editor = {Martin Gellerstam, Jerker Järborg, Sven-Göran Malmgren, Kerstin Norén, Lena Rogström, Catalina Röjder Papmehl},
publisher = {Novum Grafiska AB},
isbn = {91-87850-14-1},