Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography

Yasuo Morita; Takahiro Nakamura; Hiroshi Aizawa; June Tateno; Yukio  Tono

Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography

By admynNovember 17, 2016Euralex 2004, Publications

Page	917-922
Author	Yasuo Morita, Takahiro Nakamura, Hiroshi Aizawa, June Tateno, Yukio Tono
Title	Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography
Abstract	An extensive, balanced corpus of the Japanese language comparable in size and structure to the British National Corpus (BNC) has so far not been compiled. Consequently, the practice of compiling general purpose monolingual dictionaries on the basis of corpora has not yet been fully established in our country. Shogakukan, one of the leading publishers in Japan, has launched a project to compile a balanced Japanese-language corpus in order to produce new monolingual dictionaries of contemporary Japanese. Shogakukan has a long history and has been producing over 2,000 titles a year, ranging from general fiction, non-fiction books, magazines, and comics to dictionaries. These in-house materials are of sufficient variety and quantity to give us a headstart in producing a large-scale, balanced corpus of the Japanese language, the Shogakukan Contemporary Japanese CorpiK, with the collaboration of academic research institutions and other publishing companies, in this paper we will focus on the following two points: (1) how to define the concept of balance in terms of genre or subcorpora proportions by using breakdown statistics for publications in the domestic market in Japan; (2) how to estimate the optimal corpus size (i.e., the minimal amount of corpus data that can still adequately represent all the different linguistic behaviors of the candidate dictionary entries.).
Session	Poster Session
Keywords
BibTex	@InProceedings{ELX04-102, author = {Yasuo Morita, Takahiro Nakamura, Hiroshi Aizawa, June Tateno, Yukio Tono}, title = {Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography }, pages = {917-922}, booktitle = {Proceedings of the 11th EURALEX International Congress}, year = {2004}, month = {july}, date = {6-10}, address = {Lorient, France}, editor = {Geoffrey Williams and Sandra Vessier}, publisher = {Université de Bretagne-Sud, Faculté des lettres et des sciences humaines}, isbn = {29-52245-70-3}, }
Download

Compiling a balanced corpus of modern Japanese: design issues and implications for Japanese lexicography

Contact data

EURALEX address

EURALEX is supported by

Quick message