BUILDING A COMPREHENSIVE CHINESE CHARACTER DATABASE

Summary of paper presented at Euralex '94, an international congress for lexicographers held at the Free University Amsterdam from August 30 to September 3, 1994, plus appendixes for reference.

Showa Women's University, Institute of Modern Culture
KANJI DICTIONARY PUBLISHING SOCIETY
漢英字典刊行会
1-3-502 3-Chome Niiza
Niiza-shi, Saitama 352 JAPAN
PHONE: +81-48-481-3103 FAX: +81-48-479-1323

JACK HALPERN
Research Fellow at Institute of Modern Culture
Editor in Chief of Kanji Integrated Tools Project
Editor in Chief of New Japanese-English Character Dictionary

MASAAKI NOMURA
Professor of Japanese
Center for Japanese Language,
Waseda University

ATSUSHI FUKADA
Assistant Professor of Applied Linguistics
Center for Linguistic and Cultural Research,
Nagoya University

Index
A B S T R A C T	7. SYSTEM ANALYSIS AND DATABASE DESIGN
1. BACKGROUND	8. DEVELOPMENT OF DATABASE SYSTEM
2. PROJECT AIMS	9. DEVELOPMENT OF KIT APPLICATIONS
3. PROJECT OUTLINE	REFERENCES
4. LEXICAL SEMANTICS AND COMBINATORICS	APPENDIX A
5. COMPUTATIONAL LEXICOGRAPHY	APPENDIX B
6. DATA AND CODE CONVERSION	APPENDIX C

A B S T R A C T

The New Japanese-English Character Dictionary was designed to provide an in- depth understanding of how kanji are used in contemporary Japanese. One aim of this project is to use NJECD to build a comprehensive database with detailed information on how Chinese characters are used in Chinese, Japanese and Korean, including printed/calligraphic forms, in-depth semantics, phonemics, encoding methods, indexing schemes, synonyms, homophones, and voluminous reference data. A second aim is to use this database to compile about forty applications and spinoff products for pedagogical and research purposes, including learner's dictionaries, reference manuals, and CALL software by integrating lexical semantics and combinatorics with computational lexicography.

1. BACKGROUND

Although Japanese has been the subject of various linguistic studies, little attention has been given to the systematic analysis of its writing system. Kanji (Chinese characters as used in Japanese) are combined with each other to generate countless compound words, and function as a network of interrelated parts. Though this is vaguely recognized by educators, it has been largely disregarded in the compilation of character dictionaries. The demand for effective tools for mastering the Japanese script has been growing at an unprecedented pace. Learners are in urgent need for dictionaries that systematically address the special problems of non-Japanese students.

The New Japanese-English Character Dictionary (NJECD) (Halpern 1990, 1993) was compiled with the aim of creating a lookup tool that provides an in-depth understanding of the meanings and functions of high-frequency characters in contemporary Japanese. The dictionary departs from traditional kanji lexicography in several ways: (1) the *core meaning* defines the dominant character sense; (2) detailed meanings show how single-character morphemes generate numerous compounds; (3) psychologistic ordering reveals the logical/hierarchical interrelatedness between senses; (4) the System of Kanji Indexing by Patterns (SKIP), a new method for rapid retrieval of entries; and (5) precise distinctions between synonyms, homophones, and orthographic variants (for further details, see Halpern 1990, EURALEX '90 Proceedings).

2. PROJECT AIMS

This project aims to contribute to Sino-Japanese studies in general, and to Japanese language studies in particular, in the following four areas:

To use NJECD as a basis for creating a comprehensive kanji information database system, which will be referred to as DESK (Database System for Kanji). This database contains detailed information on the use of Chinese characters in Chinese, Japanese and Korean (CJK languages).
To use DESK as a basis for compiling about forty applications and spinoff products for pedagogical and research purposes.
To provide a comprehensive source of reference data on Chinese characters for pedagogical, linguistic and lexicological research. Some of these data will be made available on the Internet, with certain restrictions to avoid copyright violations.
To promote basic research on computational lexicography by establishing methodology for building integrated dictionary databases, especially multilingual databases for storing lexicographic data in a CJK environment.

3. PROJECT OUTLINE

To achieve these aims, the Kanji Dictionary Publishing Society was established in late 1993 as a part of the Institute of Modern Culture at Showa Women's University. The Society is directed by the Editorial Committee, which consists of renowned experts in Japanese linguistics, and is financed by the University and various foundations (1994 budget about US$250,000).

The DESK database is being used for compiling about forty computer-edited applications and spinoff products, including teaching and learning aids such as learner's dictionaries and reference manuals, foreign languages editions such as a German edition of NJECD, software packages such as CAI/CAL courseware, electronic books and learning machines, and so on. This series of products will be referred to as KIT, which stands for Kanji Integrated Tools.

During the initial phase of the project, which will be completed in mid-1994, the framework and principal components of DESK will be created, and the electronic book (EB) edition of NJECD will be published. Concurrently, the building a pilot system for a pocket edition of NJECD is in progress, which will also be completed in mid-1994.

The following KIT applications will be either published or finalized for publication over a period of two to three years:

New Japanese-English Character Dictionary: Electronic Book Edition
New Kanji-English Pocket Dictionary
New Kanji-English Learner's Dictionary
Kanji Input System Based on System of Kanji Indexing by Patterns
Comparative Study of Sino-Japanese Lexical Items
Kanji Cards
Japanese-English Dictionary of Kanji Synonyms
Japanese-English Dictionary of Kanji Usage

The EB edition of NJECD is scheduled for publication in the summer of 1994 in time for presentation at Euralex '94. This is the first kanji-English dictionary based on CD-ROM technology. It incorporates all the features of NJECD, including core meanings, independent words, homophone/synonym discrimination, compounds, radicals, a kanji thesaurus, and much more. A hierarchical menu system enables the user to easily retrieve information by specifying single or multiple keywords in normal or wordend searches, such as readings, radicals, core meanings, SKIP patterns and stroke-count. This, combined with a comprehensive cross-reference network, provides the user with multiple search paths to access information with maximum speed and facility.　

4. LEXICAL SEMANTICS AND COMBINATORICS

The principal semantic component of DESK was compiled by submitting single- character morphemes to an exhaustive semantic analysis. The meanings were analyzed by such techniques as componential analysis and an in-depth examination of the differences and similarities between near-synonyms, which served as a powerful technique for establishing precise character meanings.

Each meaning was analyzed into its single senses, and its relationships to other members of the same synonym group were examined and compared. That is, the denotation, connotation, and range of application of each sense were carefully studied in contrast with those of their near-synonym counterparts, with emphasis on how the single senses of wordforming elements are influenced not only by normal syntagmatic relations, but also by often subtle semantic/functional distinctions dependent on the morphophonemic context. For example, whereas the Chinese-drived (*on*) bound morpheme 謡 yoo means 'popular song' in such compounds as 民謡 minyoo'folk song', the native Japan- ese (*kun*)form 謡 utai refers to the chanting of a noh text.

5. COMPUTATIONAL LEXICOGRAPHY

Although every phase of the compilation and editing of NJECD was computerized, we faced great difficulties in the initial stages. MS-DOS and database management systems were not yet in widespread use, and the level of PC technology was hardly up to the task. Nevertheless, the lack of funds and technical expertise led us to select Fujitsu's FACOM-9450 series, the most advanced PC on the market at the time, rather than mini-computers.

To compile, process, and proofread the data for NJECD, we wrote about 700 programs in BASIC and used spreadsheets and other software packages from the mid-eighties, and had to resort to a series of ingenious tricks to force the hardware and software to perform tasks they were not designed for. An inevitable consequence of this was data files of complex structure, quite unlike the logically organized relational database files of today.

To produce KIT applications in a short period with maximum efficiency, it was essential to integrate state-of-the-art computer technology with such disciplines as computational lexicography and lexical semantics to restructure the data into a rationally-organized database system (DESK), and to write software for developing applications drawing data from the database. The work of building the database and application development is outlined below.

6. DATA AND CODE CONVERSION

The character set of the computers used to compile NJECD, Fujitsu's now obsolete FACOM-9450 series, supported only Level 1 characters of JIS C 6226- 1978. Since hundreds of characters were missing from the latter, we were forced to customize it by creating hundreds of user-defined characters and remapping hundreds of JIS Level 2 characters to JIS Level 1 codes. This resulted in a character set basically incompatible with current character set standards, national or corporate.

To ensure easy portability to a wide range of hardware and software platforms, we converted the data to the Shift-JIS code system and updated it to JIS X 0208-1990. In addition, we restored the remapped codes and either recreated or remapped user-defined characters not present in JIS X 0208-1990, if necessary by mapping into the supplemental character set JIS X 0212-1990, or the ISO 10646/Unicode character set, in that order. This approach, although complex, yielded excellent results by keeping user-defined characters to a bare minimum and ensuring maximum portability. It was suggested by Ken Lunde, an expert on Japanese encoding methods, who has written a definitive work on the subject (Lunde 1993).

7. SYSTEM ANALYSIS AND DATABASE DESIGN

Each entry character is associated with numerous attributes, such as a core meaning, various readings, multiple senses for each reading, and stylistic labels, and is also a member of various cross-reference networks. For example, 暖 and 温 share the *kun* reading *atatakai* but have slightly different connotations when used as free morphemes. On the other hand, 煖 and 暖 share the same meanings and *on* reading *dan* as word elements, e.g. as a verb 'to warm', but the free form 煖かい *atatakai* 'warm' is not normally used.

The entry characters and their attributes thus form an inherently complex network of semantic, orthographic and phonologic relations and subrelations often interrelated in highly complex hierarchical structures that do not easily lend themselves to representation by traditional one-to-many and many- to-many relations. Ideally, to express such intricate interrelations in a manner conducive to their effective extraction and analysis approaches the limit of relational databases, and requires a network database design. To do so within the limits of RDB systems requires a thorough analysis aimed to discover the most effective constructs that will, on the one hand, capture and represent the relations between entry characters, compounds, and their respective attributes, and, on the other, allow easy manipulation of the data with a view to efficiently generating a wide range of applications.

In spite of these limitations, we have chosen to adopt dBASE IV, a relational database management system, for a number of reasons, especially its universal availability, ease of manipulating data and developing applications using the Xbase language, and easy portability to other systems. We are also using PERL, a powerful language for text processing and string manipulation.

8. DEVELOPMENT OF DATABASE SYSTEM

The DESK database contains (or will contain) detailed information on every important aspect of Chinese characters as used in CJK languages and the principal Chinese dialects. This includes printed and calligraphic forms, in- depth semantics, phonemics, encoding methods, indexing schemes, synonyms and homophones, character etymology (based on Halpern 1987) and a wealth of other reference data.

The development of software for building the DESK database and the feeding of data to the system is being implemented in six stages.

Developing software for restructuring the old format of NJECD's data to a rationally-structured relational database system on a dBASE platform.
Defining structures and developing software for building a system that is (a) sufficiently flexible to integrate the NJECD database into the broader framework of a comprehensive CJK database system (DESK) and (b) sufficiently open-ended to accommodate large-scale expansion.
Developing software and a menu-driven user interface for querying, searching, sorting, and otherwise manipulating the database system.
Thorough testing, revision, and maintenance of the system.
Building a pilot system for generating data for the New Kanji-English Pocket Dictionary in order to verify that the system is sufficiently robust to cope with dictionary compilation under field conditions.
Feeding large volumes of data to the database from various sources, including NJECD and its German edition, character meanings, compounds and their equivalents, frequency statistics, CJK character readings, character codes, calligraphic styles, etymology, stroke-order diagrams, etc. The system will grow organically through the addition of data from new sources, the compilation of new dictionaries, and the expansion of existing ones.

9. DEVELOPMENT OF KIT APPLICATIONS

The development and compilation of KIT applications and products is being carried out in three stages:

designing the system for each application by (a) performing an in-depth analysis of its special features, such as the range of coverage, ordering scheme, entry layout, appendixes and indexes, and by (b) drawing up software specifications for each application.
building a system for each application by developing application-specific software.
thorough testing, revision, and maintenance of software.

The production of KIT printed products is being carried out in four stages:

adding new data (such as German core meanings)
editing the data generated by each application-specific system, and repeatedly checking the data until it is error-free
developing software to process the data prior to computerized photocomposition
preparing camera-ready mechanicals by DTP and/or computerized photocomposition, to be followed by printing and binding.

NOTE

Lexicography is not yet a recognized discipline in Japan. By building a comprehensive CJK database and using it for compiling numerous lexicographic works, this project will make a significant contribution to the advancement and eventual establishment of lexicography as a branch of learning in Japan, and to the promotion of the study and research of CJK languages.

REFERENCES

HALPERN, Jack (1987): 漢字の再発見 (Kanji no Saihakken) 'Rediscovering Chinese Characters'. Tokyo: Shodensha
HALPERN, Jack (1990): New Japanese-English Character Dictionary. Tokyo: Kenkyusha
HALPERN, Jack (1990): New Japanese-English Character Dictionary: A Semantic Approach to Kanji Lexicography. EURALEX '90 Proceedings: Actas del IV Congreso Internacional, 157-166. Benalmadena (Malaga): Bibliograph
HALPERN, Jack (1993): NTC's New Japanese-English Character Dictionary. Chicago: National Textbook Company
LUNDE, Ken (1993): Understanding Japanese Information Processing. Sebastopol, CA: O'Reilly & Associates

APPENDIX A: LIST OF KIT APPLICATIONS

1. GENERAL CHARACTER DICTIONARIES 一般漢英字典

Below is a list of the principal dictionaries, reference works and learning tools (DISK applications) that could be compiled on the basis of the DESK database. (The asterisk indicates that more detailed information is available for that item.)

* NTC's New Japanese-English Character Dictionary (NTC, 1993)
* New Kanji-English Pocket Dictionary 新漢英小字典
*New Kanji-English Learner's Dictionary 新漢英学習字典
* Japanese-English Dictionary of Kanji Synonyms 類義漢字和英辞典
Pocket Kanji Thesaurus 類義漢字和英小辞典
* Japanese-English Dictionary of Kanji Usage 同訓使い分け和英辞典
Japanese-English Kanji Compounds Dictionary 実用漢英熟語字典・一般編
* New Japanese-German Character Dictionary 新漢独字典
New Japanese-Spanish Character Dictionary 新漢西字典
New Japanese-French Character Dictionary 新漢仏字典

2. SPECIAL-PURPOSE DICTIONARIES/REFERENCE WORKS 特殊漢字字典・参考書

Introduction to Kanji 漢字入門
*Kanji-English Dictionary for Business and Economics 実用漢英熟語字典・経済編
Kanji-English Dictionary for the Arts and Humanities 実用漢英熟語字典・文化編
Kanji-English Dictionary for Science and Technology 実用漢英熟語字典・科学技術編
Introduction to Kanji Compound Formation 漢字熟語成立ち入門
Japanese-English Dictionary of Prefixes and Suffixes 漢字接辞和英辞典
Japanese-English Dictionary for Counters and Units 単位・助数詞和英辞典
Kanji Reference Handbook 漢英参考情報便覧
Japanese-English Dictionary of Character Etymology 漢英字源字典
Introduction to the Radical System 漢字部首入門
Introduction to Written Japanese 日本語書き方入門
*Comparative Study of Sino-Japanese Lexical Items 漢語語彙比較研究

3. ELECTRONIC DICTIONARIES, OTHERS 電子字典・その他

Kanji Learner's Electronic Dictionary 電子漢字学習機
Kanji Learner's Courseware 漢字学習コースウェア
*Kanji Input System Based on System of Kanji Indexing by Patterns 字型検字法による漢字入力方式
Kanji Games Software Kit 漢字学習ゲームソフト
JIS Kanji Index Based on System of Kanji Indexing by Patterns 字型検字法によるＪＩＳ漢字索引
*New Japanese-English Character Dictionary: Electronic Book Edition 新漢英字典電子ブック版
New Japanese-English Character Dictionary: CD-ROM Edition 新漢英字典ＣＤ－ＲＯＭ版
Kanji Learner's Wall Chart 漢字学習貼紙表
*Kanji Cards 漢字学習カード
Introduction to Kanji: Video Edition 漢字学習ビデオ
Train and Subway Kanji Guide 電車・列車漢字案内
Restaurant Kanji Guide レストラン漢字案内

4. DICTIONARIES AND AIDS FOR JAPANESE USERS 日本人対象の字典・教材

Dictionary of Kanji Synonyms 類義漢字辞典
Pocket Kanji Thesaurus 類義漢字小辞典
Dictionary of Kanji Usage 同訓使い分け辞典
Kanji Learner's Dictionary for Elementary Schoolchildren 小学生用漢字学習字典
Dictionary of Kanji Compound Formation 漢字熟語構成辞典
Kanji Learner's Courseware 漢字学習コースウェア
Kanji Learner's Dictionary: Electronic Book Edition 漢字学習字典電子ブック版
Introduction to Kanji Compound Formation 漢字熟語成立ち入門
Kanji Learner's Graded Wall Chart 学年別漢字学習貼紙表

APPENDIX B: EDITORIAL COMMITTEE OF KANJI DICTIONARY PUBLISHING SOCIETY

KUSUO HITOMI President of Showa Women's University
Director General and President of KDPS
Chairman of KDPS Editorial Committee

OKI HAYASHI President of the Society for Teaching Japanese as a
Foreign Language
formerly President of the National Language Research
Institute
Consultant to KDPS Editorial Committee

OSAMU MIZUTANI Director General of the National Language Research Institute
Councilor of the Society for Teaching Japanese as a Foreign
Language
Consultant to KDPS Editorial Committee

SHIGEHIKO TOYAMA Professor at the Graduate School of Literature, Showa
Women's University
Member of KDPS Editorial Committee

TAKASHI TAKAMIZAWA Professor/Director of the Course of Japanese Literature,
Showa Women's University
Member of KDPS Editorial Committee

CHIKASADA HARADA Professor of Japanese Literature, Showa Women's University
Member of KDPS Editorial Committee

TOMOKO KANEKO Professor of English and American Literature, Showa
Women's University
Member of KDPS Editorial Committee

KEN LUNDE Project Manager of Japanese Font Production at Adobe
Systems, Inc.
Technical Consultant to KDPS

YOSHIAKI TAKEBE formerly Professor at Waseda University
Member of KDPS Editorial Committee

MASAAKI NOMURA Professor of Japanese at Center for Japanese Language,
Waseda University
Member of KDPS Editorial Committee

ATSUSHI FUKADA Assistant Professor of Applied Linguistics at Center for
Linguistic and Cultural Research, Nagoya University
Member of KDPS Editorial Committee

YOICHIRO YAMAMURA President of Brain Brigade Systems, Ltd.
Production and Marketing Consultant to KDPS

JACK HALPERN Research Fellow at Institute of Modern Culture, Showa
Women's University
Editor in Chief of New Japanese-English Character
Dictionary
Editor in Chief of Kanji Integrated Tools Project

APPENDIX C: OVERVIEW OF PRINCIPAL FEATURES

Listed below are the principal features of DESK-KIT applications and products. The presence or absence of a specific feature depends on the item in question . For more information, see the individual descriptions for each project (available on request), and Features of This Dictionary on page 61 of NJECD).

Core meaning -- a concise keyword that defines the most dominant sense of each character to provide an instant grasp of its fundamental concept.
Psychologistic ordering of character meanings, clustered around the core meaning in a manner that allows them to be conceived as a logically-structured, integrated unit.
Complete and accurate character meanings clearly show how a few thousand building blocks are combined to generate countless compound words.
Numerous high-frequency compounds provide maximally useful examples of each character sense and clearly show how these contribute to the meaning of each compound.
Compound formation articles describe the etymology of compounds and explain how their constituent characters contribute to their meanings.
Synonym articles provide full guidance on the differences and similarities between closely related characters.
Detailed usage notes help you understand the fine distinctions between kun homophones.
System of Kanji Indexing by Patterns -- a totally new method for looking up characters as quickly as in alphabetical dictionaries
Six lookup methods and three indexes allow even a complete beginner to locate entries with great speed and little effort.
A system of labels provides useful information on the temporal status, etymology, orthography, style, function, level of formality, etc., of character senses.
The degree of importance of each character sense is indicated by various typographical differences and status labels for four levels of study.
Quick access to a valuable source of supplementary reference data, such as the principles of stroke order, frequency lists, historical tables, rules for okurigana, kana charts, a list of kanji synonyms.
A user-friendly format ensures a visually attractive layout and maximum ease of use.