Index to This Document |
---|
A plethora of abstruse terms are used to describe the orthographic relations between words, including homograph, heteronym, homologue, heterograph and homonym, to name a few. Much confusion prevails, since these terms are often used inconsistently, even by professional linguists. This topic deserves a full paper in its own right. Here, we will keep it simple and define the most important terms.
Japanese orthography is so highly irregular that it can be considered, without the slightest fear of being accused of hyperbole, to be a couple of orders of magnitude more complex and more irregular than any other major language, Chinese included. A major source of complexity in processing Japanese texts is the presence of an extremely large number of homophones.
This article presents a brief overview of Japanese homophony. Our aim is to demonstrate that even professional writers and sophisticated users are confused by the subtle distinctions between the numerous homophones in Japanese, and to assert that treatment of homophones in Japanese texts deserves special attention in the realms of NLP, MT, IR and IME applications.
Here is an example of how complex the problem is. Let us say take the phrase Hi no sasanai yashiki (A Mansion with no Sunshine), which could be the name of a novel or a film. Here are twelve legitimate ways (some more likely than others) of how to write this.
- 日の差さない屋敷
- 日の射さない屋敷
- 日のささない屋敷
- 日の射さない邸
- 日の差さない邸
- 日のささない邸
- 陽の射さない屋敷
- 陽の差さない屋敷
- 陽のささない屋敷
- 陽の射さない邸
- 陽の差さない邸
- 陽のささない邸
We did a survey on six native Japanese speakers, some of whom are professional translators and writers, asking them how they would write the above phrase. Surprisingly, we received six different answers, none of which matched the "standard" form found in dictionaries (#1 above). Clearly, even native speakers of Japanese cannot possibly be expected to know which specific variant is used in the official title.
An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones. Kooki and kikoo, for instance, each represent about a dozen words in common use, and the only way to distinguish between such compounds as 機構 kikoo 'mechanism' and 帰港 kikoo 'returning to the harbor' is through the characters. Although on (Chinese derived) homophones like the above may occasionally cause confusion in the spoken language, they are easily distinguished in the written language.
On the other hand, the abundance of kun (native Japanese) homophones is a source of confusion even to professional writers and editors. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. In extreme cases, such as the word sasu, a kun word can be written in dozens of ways, though only several of these are in common use. Unlike on homophones, the majority of kun homophones are often close or even identical in meaning and thus easily confused, as shown in the table below:
Easily Distinguished | Easily Confused | ||||||
---|---|---|---|---|---|---|---|
hashi noboru
| 橋 | 端 箸 bridge | end, edge chopsticks 上る | 登る 昇る go up (steps, a hill) | climb, scale ascend, rise (up to the sky) |
Another problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses but not in others. For example, 解ける tokeru and 溶ける tokeru are interchangeable in the sense of 'melt, thaw' but not in the sense of 'come loose', which is written 解ける. On the other hand, the meanings of some homophones are identical or nearly identical. For example, yawarakai 'soft, subdued; gentle' is written 柔らかい or 軟らかい with exactly the same meaning.
To make matters worse, the distinctions between some homophones are so subtle that many authors don't even try to select the most appropriate kanji and resort to the "easy solution" of using hiragana instead, making the meaning fuzzy and identification more difficult.
By "homophone processing" we mean such operations as cross-homophone searching, homophone disambiguation in IME systems, and homophone identification in MT applications. Homophone processing requires a semantically classified database of homophones and a homophone expansion algorithm.
The process of retrieving or identifying Japanese homophones is not, in itself, any more difficult than searching for such English homophones as right and write. But there are factors that make the processing of Japanese homophones far more challenging than in any other language. From a text processing point of view, the major issue is that for many kun homophones, a universally-accepted orthography does not exist. Theoretically, the choice of character should be based on meaning, but in fact it is often unpredictable and governed by personal preferences.For example, when a search engine user enters a query that involves homophones, she can never be sure which particular one to select, since often there is no one right answer. The table below demonstrates why this is so by showing the complex semantic interrelations between the homophones for sasu.
No. | English | "Standard" Form | Sometimes also | Often also |
---|---|---|---|---|
1 | to offer | 差す | さす | |
2 | to hold up | 差す | さす | |
3 | to pour into | 差す | 注す | さす |
4 | to color | 差す | 注す | さす |
5 | to shine on | 差す | 射す | さす |
6 | to aim at | 指す | 差す | |
6 | to point to | 指す | さす | |
7 | to stab | 刺す | さす | |
8 | to leave unfinished | さす | 止す |
To sum up, Japanese homophones have certain characteristics that present difficulties in Japanese text processing:
Implementing homophone processing technology requires a comprehensive database of semantically and etymologically classified homophones. Merely retrieving all homophones will do far more harm than good since it will match numerous irrelevant homophones, such as 変える kaeru 'to change' for 帰る kaeru 'to return'.
JACK HALPERN 春遍雀來 (ハルペン・ジャック)
President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University
Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.
Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK).
List of PublicationsFollowing is a list of the author’s principal publications in the field of CJK lexicography.
The The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.
The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:
- Dozens of lexicographic works, including electronic dictionaries.
- Search engine applications, such as morphological analyzers and simplified to/from traditional Chinese conversion systems.
- CJK input method editors (IME) and front-end processors (FEP).
- Machine translation, online translation tools and speech technology software.
- Pedagogical, linguistic and computational lexicography research.
DESK currently has over two million Japanese and about 2.5 million simplified and traditional Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.
The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.
Jack Halpern The CJK Dictionary Institute, Inc. 日中韓辭典研究所 34-14, 2-chome, Tohoku, Niiza-shi Saitama 352-0001 JAPAN Phone: +81-48-473-3508 Fax: +81-42-587-3318 Email: jack@cjk.org WWW: http://www.cjk.org |