The Complexities of Japanese Homophones

Jack Halpern
CEO
The CJK Dictionary Institute, Inc.
株式会社日中韓辭典研究所
Revised: June 15, 2001

Index to This Document
Some Definitions Introduction Overview of Japanese Homophony Homophone Processing About the Author List of Publications The CJK Dictionary Institute

1. Some Definitions

A plethora of abstruse terms are used to describe the orthographic relations between words, including homograph, heteronym, homologue, heterograph and homonym, to name a few. Much confusion prevails, since these terms are often used inconsistently, even by professional linguists. This topic deserves a full paper in its own right. Here, we will keep it simple and define the most important terms.

Homophone: One of two or more words that are pronounced the same but differ in writing and usually in meaning (e.g. principal and principle).
Homograph: One of two or more words that are written the same but differ in pronunciation and (usually) in meaning (misleadingly also called heteronyms) (e.g. minute "60 seconds" and minute "very small").
Homonym: One of two or more words that are identical in writing and/or pronunciation but differ in meaning (sometimes called homologues) (e.g. light "not heavy" and light "not dark").
Orthographic Variant: One of two or more words that are written differently but are identical in pronunciation and meaning (sometimes called heterographs) (e.g. judgement and judgment).

2. Introduction

Japanese orthography is so highly irregular that it can be considered, without the slightest fear of being accused of hyperbole, to be a couple of orders of magnitude more complex and more irregular than any other major language, Chinese included. A major source of complexity in processing Japanese texts is the presence of an extremely large number of homophones.

This article presents a brief overview of Japanese homophony. Our aim is to demonstrate that even professional writers and sophisticated users are confused by the subtle distinctions between the numerous homophones in Japanese, and to assert that treatment of homophones in Japanese texts deserves special attention in the realms of NLP, MT, IR and IME applications.

Here is an example of how complex the problem is. Let us say take the phrase Hi no sasanai yashiki (A Mansion with no Sunshine), which could be the name of a novel or a film. Here are twelve legitimate ways (some more likely than others) of how to write this.

日の差さない屋敷
日の射さない屋敷
日のささない屋敷
日の射さない邸
日の差さない邸
日のささない邸
陽の射さない屋敷
陽の差さない屋敷
陽のささない屋敷
陽の射さない邸
陽の差さない邸
陽のささない邸

We did a survey on six native Japanese speakers, some of whom are professional translators and writers, asking them how they would write the above phrase. Surprisingly, we received six different answers, none of which matched the "standard" form found in dictionaries (#1 above). Clearly, even native speakers of Japanese cannot possibly be expected to know which specific variant is used in the official title.

3. Overview of Japanese Homophony

An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones. Kooki and kikoo, for instance, each represent about a dozen words in common use, and the only way to distinguish between such compounds as 機構 kikoo 'mechanism' and 帰港 kikoo 'returning to the harbor' is through the characters. Although on (Chinese derived) homophones like the above may occasionally cause confusion in the spoken language, they are easily distinguished in the written language.

On the other hand, the abundance of kun (native Japanese) homophones is a source of confusion even to professional writers and editors. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. In extreme cases, such as the word sasu, a kun word can be written in dozens of ways, though only several of these are in common use. Unlike on homophones, the majority of kun homophones are often close or even identical in meaning and thus easily confused, as shown in the table below:

Kun Homophones
Easily Distinguished		Easily Confused
hashi		noboru
橋端箸	bridge end, edge chopsticks	上る登る昇る	go up (steps, a hill) climb, scale ascend, rise (up to the sky)

Another problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses but not in others. For example, 解ける tokeru and 溶ける tokeru are interchangeable in the sense of 'melt, thaw' but not in the sense of 'come loose', which is written 解ける. On the other hand, the meanings of some homophones are identical or nearly identical. For example, yawarakai 'soft, subdued; gentle' is written 柔らかい or 軟らかい with exactly the same meaning.

To make matters worse, the distinctions between some homophones are so subtle that many authors don't even try to select the most appropriate kanji and resort to the "easy solution" of using hiragana instead, making the meaning fuzzy and identification more difficult.

4 Homophone Processing

By "homophone processing" we mean such operations as cross-homophone searching, homophone disambiguation in IME systems, and homophone identification in MT applications. Homophone processing requires a semantically classified database of homophones and a homophone expansion algorithm.

The process of retrieving or identifying Japanese homophones is not, in itself, any more difficult than searching for such English homophones as right and write. But there are factors that make the processing of Japanese homophones far more challenging than in any other language. From a text processing point of view, the major issue is that for many kun homophones, a universally-accepted orthography does not exist. Theoretically, the choice of character should be based on meaning, but in fact it is often unpredictable and governed by personal preferences.

For example, when a search engine user enters a query that involves homophones, she can never be sure which particular one to select, since often there is no one right answer. The table below demonstrates why this is so by showing the complex semantic interrelations between the homophones for sasu.

Kun Homophones for sasu
No. English "Standard"
Form Sometimes
also Often
also
1 to offer 差すさす
2 to hold up 差すさす
3 to pour into 差す注すさす
4 to color 差す注すさす
5 to shine on 差す射すさす
6 to aim at 指す差す
6 to point to 指すさす
7 to stab 刺すさす
8 to leave unfinished さす止す

Kun Homophones for *sasu*
No.	English	"Standard" Form	Sometimes also	Often also
1	to offer	差す		さす
2	to hold up	差す		さす
3	to pour into	差す	注す	さす
4	to color	差す	注す	さす
5	to shine on	差す	射す	さす
6	to aim at	指す	差す
6	to point to	指す	さす
7	to stab	刺す	さす
8	to leave unfinished	さす	止す

To sum up, Japanese homophones have certain characteristics that present difficulties in Japanese text processing:

Since many kun homophones are nearly synonymous or even identical in meaning, they are easily confused. As a result, there is no way to predict which particular homophone will appear in a text.
The distinction between some homophones is so subtle that many authors sidestep the irksome task of selecting the appropriate kanji and resort to hiragana.
Since Japanese has only a small stock of phonemes, the number of homophones is very large.

Implementing homophone processing technology requires a comprehensive database of semantically and etymologically classified homophones. Merely retrieving all homophones will do far more harm than good since it will match numerous irrelevant homophones, such as 変える kaeru 'to change' for 帰る kaeru 'to return'.

About the Author

JACK HALPERN 春遍雀來 (ハルペン・ジャック)

President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University

Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.

Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK).

List of Publications

Following is a list of the author’s principal publications in the field of CJK lexicography.

Halpern, Jack (1982): “Linguistic Analysis of the Function of Kanji in Modern Japanese,” 27th International Conference of Orientalists in Tokyo.
Halpern, Jack (1985): “Function of Kanji in Modern Japanese, ” Transactions of the International Conference of Orientalists in Japan. The Tōhō Gakkai (The Institute of Eastern Culture). 27th International Conference of Orientalists in Japan in Tokyo.
Halpern, Jack (1985): “Kenkyusha’s New Japanese-English Character Dictionary,” Calico Journal, December 1985.
Halpern, Jack (1987): 漢字の再発見 Kanji no Saihakken ‘Rediscovering Chinese Characters’. Tokyo: Shodensha.
Halpern, Jack (1990): New Japanese-English Character Dictionary (Sixth Printing). Tokyo: Kenkyusha.
Halpern, Jack (1990): “New Japanese-English Character Dictionary: A Semantic Approach to Kanji Lexicography,” Euralex ’90 Proceedings. Actas del IV Congreso Internacional, 157-166. Benalmádena (Málaga): Bibliograf.
Halpern, Jack (1993): NTC’s New Japanese-English Character Dictionary. Chicago: National Textbook Company.
Halpern Jack, Nomura Masaaki, and Fukada Atsushi (1994): “Building a Comprehensive Chinese Character Database,” Euralex ’94 Proceedings. International Congress on Lexicography in Amsterdam.
Halpern, Jack (1995): New Japanese-English Character Dictionary, Electronic Book Edition. Tokyo: Nichigai Associates.
Halpern, Jack (1998): “Building A Comprehensive Database for the Compilation of Integrated Kanji Dictionaries and Tools,” 43rd International Conference of Orientalists in Tokyo.
Halpern, Jack (1999): The Kodansha Kanji Learner’s Dictionary. Tokyo: Kodansha International.
Halpern, Jack and Kerman, Jouni (1999): “The Pitfalls and Complexities of Chinese to Chinese Conversion,” Fourteenth International Unicode Conference in Cambridge, Massachusetts.
Halpern, Jack (2000): “The Challenges of Intelligent Japanese Searching,” Tokyo.
Halpern, Jack: Dictionary of Unified CJK Characters -- for the Unicode Standard. Forthcoming.

The CJK Dictionary Institute

The The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.

The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:

Dozens of lexicographic works, including electronic dictionaries.
Search engine applications, such as morphological analyzers and simplified to/from traditional Chinese conversion systems.
CJK input method editors (IME) and front-end processors (FEP).
Machine translation, online translation tools and speech technology software.
Pedagogical, linguistic and computational lexicography research.

DESK currently has over two million Japanese and about 2.5 million simplified and traditional Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.

The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.

President
Jack Halpern
The CJK Dictionary Institute, Inc.
日中韓辭典研究所

34-14, 2-chome, Tohoku, Niiza-shi
Saitama 352-0001 JAPAN
Phone: +81-48-473-3508
Fax: +81-42-587-3318
Email: jack@cjk.org
WWW: http://www.cjk.org