Abstract |
An extensive, balanced corpus of the Japanese language comparable in size and structure to the British National Corpus (BNC) has so far not been compiled. Consequently, the practice of compiling general purpose monolingual dictionaries on the basis of corpora has not yet been fully established in our country. Shogakukan, one of the leading publishers in Japan, has launched a project to compile a balanced Japanese-language corpus in order to produce new monolingual dictionaries of contemporary Japanese. Shogakukan has a long history and has been producing over 2,000 titles a year, ranging from general fiction, non-fiction books, magazines, and comics to dictionaries. These in-house materials are of sufficient variety and quantity to give us a headstart in producing a large-scale, balanced corpus of the Japanese language, the Shogakukan Contemporary Japanese CorpiK, with the collaboration of academic research institutions and other publishing companies, in this paper we will focus on the following two points: (1) how to define the concept of balance in terms of genre or subcorpora proportions by using breakdown statistics for publications in the domestic market in Japan; (2) how to estimate the optimal corpus size (i.e., the minimal amount of corpus data that can still adequately represent all the different linguistic behaviors of the candidate dictionary entries.). |
BibTex |
@InProceedings{ELX04-102, author = {Yasuo Morita, Takahiro Nakamura, Hiroshi Aizawa, June Tateno, Yukio Tono}, title = {Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography }, pages = {917-922}, booktitle = {Proceedings of the 11th EURALEX International Congress}, year = {2004}, month = {july}, date = {6-10}, address = {Lorient, France}, editor = {Geoffrey Williams and Sandra Vessier}, publisher = {UniversiteĢ de Bretagne-Sud, FaculteĢ des lettres et des sciences humaines}, isbn = {29-52245-70-3}, } |