Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography

By November 17, 2016,
Page 917-922
Author Yasuo Morita, Takahiro Nakamura, Hiroshi Aizawa, June Tateno, Yukio Tono
Title Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography
Abstract An extensive, balanced corpus of the Japanese language comparable in size and structure to the British National Corpus (BNC) has so far not been compiled. Consequently, the practice of compiling general purpose monolingual dictionaries on the basis of corpora has not yet been fully established in our country. Shogakukan, one of the leading publishers in Japan, has launched a project to compile a balanced Japanese-language corpus in order to produce new monolingual dictionaries of contemporary Japanese. Shogakukan has a long history and has been producing over 2,000 titles a year, ranging from general fiction, non-fiction books, magazines, and comics to dictionaries. These in-house materials are of sufficient variety and quantity to give us a headstart in producing a large-scale, balanced corpus of the Japanese language, the Shogakukan Contemporary Japanese CorpiK, with the collaboration of academic research institutions and other publishing companies, in this paper we will focus on the following two points: (1) how to define the concept of balance in terms of genre or subcorpora proportions by using breakdown statistics for publications in the domestic market in Japan; (2) how to estimate the optimal corpus size (i.e., the minimal amount of corpus data that can still adequately represent all the different linguistic behaviors of the candidate dictionary entries.).
Session Poster Session
Keywords
BibTex
@InProceedings{ELX04-102,
author = {Yasuo Morita, Takahiro Nakamura, Hiroshi Aizawa, June Tateno, Yukio Tono},
title = {Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography },
pages = {917-922},
booktitle = {Proceedings of the 11th EURALEX International Congress},
year = {2004},
month = {july},
date = {6-10},
address = {Lorient, France},
editor = {Geoffrey Williams and Sandra Vessier},
publisher = {UniversiteĢ de Bretagne-Sud, FaculteĢ des lettres et des sciences humaines},
isbn = {29-52245-70-3},
}
Download