Summary of paper presented at Euralex '94, an international congress for lexicographers held at the Free University Amsterdam from August 30 to September 3, 1994, plus appendixes for reference. |
Showa Women's University, Institute of Modern Culture KANJI DICTIONARY PUBLISHING SOCIETY Š¿ ‰p Žš “T Š§ s ‰ï 1-3-502 3-Chome Niiza Niiza-shi, Saitama 352 JAPAN PHONE: +81-48-481-3103 FAX: +81-48-479-1323 JACK HALPERN Research Fellow at Institute of Modern Culture Editor in Chief of Kanji Integrated Tools Project Editor in Chief of New Japanese-English Character Dictionary MASAAKI NOMURA Professor of Japanese Center for Japanese Language, Waseda University ATSUSHI FUKADA Assistant Professor of Applied Linguistics Center for Linguistic and Cultural Research, Nagoya University |
The New Japanese-English Character Dictionary was designed to provide an in- depth understanding of how kanji are used in contemporary Japanese. One aim of this project is to use NJECD to build a comprehensive database with detailed information on how Chinese characters are used in Chinese, Japanese and Korean, including printed/calligraphic forms, in-depth semantics, phonemics, encoding methods, indexing schemes, synonyms, homophones, and voluminous reference data. A second aim is to use this database to compile about forty applications and spinoff products for pedagogical and research purposes, including learner's dictionaries, reference manuals, and CALL software by integrating lexical semantics and combinatorics with computational lexicography.
Although Japanese has been the subject of various linguistic studies, little attention has been given to the systematic analysis of its writing system. Kanji (Chinese characters as used in Japanese) are combined with each other to generate countless compound words, and function as a network of interrelated parts. Though this is vaguely recognized by educators, it has been largely disregarded in the compilation of character dictionaries. The demand for effective tools for mastering the Japanese script has been growing at an unprecedented pace. Learners are in urgent need for dictionaries that systematically address the special problems of non-Japanese students.
The New Japanese-English Character Dictionary (NJECD) (Halpern 1990, 1993) was compiled with the aim of creating a lookup tool that provides an in-depth understanding of the meanings and functions of high-frequency characters in contemporary Japanese. The dictionary departs from traditional kanji lexicography in several ways: (1) the *core meaning* defines the dominant character sense; (2) detailed meanings show how single-character morphemes generate numerous compounds; (3) psychologistic ordering reveals the logical/hierarchical interrelatedness between senses; (4) the System of Kanji Indexing by Patterns (SKIP), a new method for rapid retrieval of entries; and (5) precise distinctions between synonyms, homophones, and orthographic variants (for further details, see Halpern 1990, EURALEX '90 Proceedings).
This project aims to contribute to Sino-Japanese studies in general, and to Japanese language studies in particular, in the following four areas:
To achieve these aims, the Kanji Dictionary Publishing Society was established in late 1993 as a part of the Institute of Modern Culture at Showa Women's University. The Society is directed by the Editorial Committee, which consists of renowned experts in Japanese linguistics, and is financed by the University and various foundations (1994 budget about US$250,000).
The DESK database is being used for compiling about forty computer-edited applications and spinoff products, including teaching and learning aids such as learner's dictionaries and reference manuals, foreign languages editions such as a German edition of NJECD, software packages such as CAI/CAL courseware, electronic books and learning machines, and so on. This series of products will be referred to as KIT, which stands for Kanji Integrated Tools.
During the initial phase of the project, which will be completed in mid-1994, the framework and principal components of DESK will be created, and the electronic book (EB) edition of NJECD will be published. Concurrently, the building a pilot system for a pocket edition of NJECD is in progress, which will also be completed in mid-1994.
The following KIT applications will be either published or finalized for publication over a period of two to three years:
The EB edition of NJECD is scheduled for publication in the summer of 1994 in time for presentation at Euralex '94. This is the first kanji-English dictionary based on CD-ROM technology. It incorporates all the features of NJECD, including core meanings, independent words, homophone/synonym discrimination, compounds, radicals, a kanji thesaurus, and much more. A hierarchical menu system enables the user to easily retrieve information by specifying single or multiple keywords in normal or wordend searches, such as readings, radicals, core meanings, SKIP patterns and stroke-count. This, combined with a comprehensive cross-reference network, provides the user with multiple search paths to access information with maximum speed and facility.@
The principal semantic component of DESK was compiled by submitting single- character morphemes to an exhaustive semantic analysis. The meanings were analyzed by such techniques as componential analysis and an in-depth examination of the differences and similarities between near-synonyms, which served as a powerful technique for establishing precise character meanings.
Each meaning was analyzed into its single senses, and its relationships to other members of the same synonym group were examined and compared. That is, the denotation, connotation, and range of application of each sense were carefully studied in contrast with those of their near-synonym counterparts, with emphasis on how the single senses of wordforming elements are influenced not only by normal syntagmatic relations, but also by often subtle semantic/functional distinctions dependent on the morphophonemic context. For example, whereas the Chinese-drived (*on*) bound morpheme —w yoo means 'popular song' in such compounds as –¯—w minyoo€'folk song', the native Japan- ese (*kun*)form —w utai refers to the chanting of a noh text.
Although every phase of the compilation and editing of NJECD was computerized, we faced great difficulties in the initial stages. MS-DOS and database management systems were not yet in widespread use, and the level of PC technology was hardly up to the task. Nevertheless, the lack of funds and technical expertise led us to select Fujitsu's FACOM-9450 series, the most advanced PC on the market at the time, rather than mini-computers.
To compile, process, and proofread the data for NJECD, we wrote about 700 programs in BASIC and used spreadsheets and other software packages from the mid-eighties, and had to resort to a series of ingenious tricks to force the hardware and software to perform tasks they were not designed for. An inevitable consequence of this was data files of complex structure, quite unlike the logically organized relational database files of today.
To produce KIT applications in a short period with maximum efficiency, it was essential to integrate state-of-the-art computer technology with such disciplines as computational lexicography and lexical semantics to restructure the data into a rationally-organized database system (DESK), and to write software for developing applications drawing data from the database. The work of building the database and application development is outlined below.
The character set of the computers used to compile NJECD, Fujitsu's now obsolete FACOM-9450 series, supported only Level 1 characters of JIS C 6226- 1978. Since hundreds of characters were missing from the latter, we were forced to customize it by creating hundreds of user-defined characters and remapping hundreds of JIS Level 2 characters to JIS Level 1 codes. This resulted in a character set basically incompatible with current character set standards, national or corporate.
To ensure easy portability to a wide range of hardware and software platforms, we converted the data to the Shift-JIS code system and updated it to JIS X 0208-1990. In addition, we restored the remapped codes and either recreated or remapped user-defined characters not present in JIS X 0208-1990, if necessary by mapping into the supplemental character set JIS X 0212-1990, or the ISO 10646/Unicode character set, in that order. This approach, although complex, yielded excellent results by keeping user-defined characters to a bare minimum and ensuring maximum portability. It was suggested by Ken Lunde, an expert on Japanese encoding methods, who has written a definitive work on the subject (Lunde 1993).
Each entry character is associated with numerous attributes, such as a core meaning, various readings, multiple senses for each reading, and stylistic labels, and is also a member of various cross-reference networks. For example, ’g and ‰· share the *kun* reading *atatakai* but have slightly different connotations when used as free morphemes. On the other hand, à‹ and ’g share the same meanings and *on* reading *dan* as word elements, e.g. as a verb 'to warm', but the free form à‹‚©‚¢ *atatakai* 'warm' is not normally used.
The entry characters and their attributes thus form an inherently complex network of semantic, orthographic and phonologic relations and subrelations often interrelated in highly complex hierarchical structures that do not easily lend themselves to representation by traditional one-to-many and many- to-many relations. Ideally, to express such intricate interrelations in a manner conducive to their effective extraction and analysis approaches the limit of relational databases, and requires a network database design. To do so within the limits of RDB systems requires a thorough analysis aimed to discover the most effective constructs that will, on the one hand, capture and represent the relations between entry characters, compounds, and their respective attributes, and, on the other, allow easy manipulation of the data with a view to efficiently generating a wide range of applications.
In spite of these limitations, we have chosen to adopt dBASE IV, a relational database management system, for a number of reasons, especially its universal availability, ease of manipulating data and developing applications using the Xbase language, and easy portability to other systems. We are also using PERL, a powerful language for text processing and string manipulation.
The DESK database contains (or will contain) detailed information on every important aspect of Chinese characters as used in CJK languages and the principal Chinese dialects. This includes printed and calligraphic forms, in- depth semantics, phonemics, encoding methods, indexing schemes, synonyms and homophones, character etymology (based on Halpern 1987) and a wealth of other reference data.
The development of software for building the DESK database and the feeding of data to the system is being implemented in six stages.
The development and compilation of KIT applications and products is being carried out in three stages:
The production of KIT printed products is being carried out in four stages:
Lexicography is not yet a recognized discipline in Japan. By building a comprehensive CJK database and using it for compiling numerous lexicographic works, this project will make a significant contribution to the advancement and eventual establishment of lexicography as a branch of learning in Japan, and to the promotion of the study and research of CJK languages.
Below is a list of the principal dictionaries, reference works and learning tools (DISK applications) that could be compiled on the basis of the DESK database. (The asterisk indicates that more detailed information is available for that item.)
KUSUO HITOMI President of Showa Women's University
Director General and President of KDPS
Chairman of KDPS Editorial Committee
OKI HAYASHI President of the Society for Teaching Japanese as a
Foreign Language
formerly President of the National Language Research
Institute
Consultant to KDPS Editorial Committee
OSAMU MIZUTANI Director General of the National Language Research Institute
Councilor of the Society for Teaching Japanese as a Foreign
Language
Consultant to KDPS Editorial Committee
SHIGEHIKO TOYAMA Professor at the Graduate School of Literature, Showa
Women's University
Member of KDPS Editorial Committee
TAKASHI TAKAMIZAWA Professor/Director of the Course of Japanese Literature,
Showa Women's University
Member of KDPS Editorial Committee
CHIKASADA HARADA Professor of Japanese Literature, Showa Women's University
Member of KDPS Editorial Committee
TOMOKO KANEKO Professor of English and American Literature, Showa
Women's University
Member of KDPS Editorial Committee
KEN LUNDE Project Manager of Japanese Font Production at Adobe
Systems, Inc.
Technical Consultant to KDPS
YOSHIAKI TAKEBE formerly Professor at Waseda University
Member of KDPS Editorial Committee
MASAAKI NOMURA Professor of Japanese at Center for Japanese Language,
Waseda University
Member of KDPS Editorial Committee
ATSUSHI FUKADA Assistant Professor of Applied Linguistics at Center for
Linguistic and Cultural Research, Nagoya University
Member of KDPS Editorial Committee
YOICHIRO YAMAMURA President of Brain Brigade Systems, Ltd.
Production and Marketing Consultant to KDPS
JACK HALPERN Research Fellow at Institute of Modern Culture, Showa
Women's University
Editor in Chief of New Japanese-English Character
Dictionary
Editor in Chief of Kanji Integrated Tools Project
Listed below are the principal features of DESK-KIT applications and products. The presence or absence of a specific feature depends on the item in question . For more information, see the individual descriptions for each project (available on request), and Features of This Dictionary on page 61 of NJECD).