From erik@sran8.sra.co.jp Sun Nov 18 04:55:32 1990
Received: from mcsun.EU.net by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA09767; Sun, 18 Nov 90 04:55:32 +0100
Received: by mcsun.EU.net with SMTP; Sun, 18 Nov 90 04:58:50 +0100
Received: from srava.sra.co.jp by srawgw.sra.co.jp (5.64WH/1.4)
	id AA21533; Sun, 18 Nov 90 12:57:42 +0900
Received: from sran8.sra.co.jp by srava.sra.co.jp (5.64b/6.4J.6-BJW)
	id AA10873; Sun, 18 Nov 90 12:57:43 +0900
Received: from localhost by sran8.sra.co.jp (4.0/6.4J.6-SJ)
	id AA24323; Sun, 18 Nov 90 12:56:19 JST
Return-Path: <erik@sran8.sra.co.jp>
Message-Id: <9011180356.AA24323@sran8.sra.co.jp>
Reply-To: erik@sra.co.jp
From: Erik M. van der Poel <erik@sra.co.jp>
To: Becker.OSBU_North@xerox.com
Cc: unicode@sun.com, i18n@dkuug.dk, arnet@hpda.cup.hp.com,
        arnet@hpcupt1.cup.hp.com
Subject: Re: Han Character Code Ordering
Date: Sun, 18 Nov 90 12:56:17 +0900
Sender: erik@sran8.sra.co.jp
X-Charset: ASCII
X-Char-Esc: 29

> It might be mentioned that nearly all book-form Han character
> dictionaries in Taiwan, Japan, and Korea use a radical/stroke order;
> and ordering via radicals and stroke counts is in fact a part of every
> national encoding standard except KS C5601.  So any statement that
> this scheme is "foreign to Japanese eyes" is obviously false and must
> have resulted from some kind of misunderstanding.

Yes, it is true that Han character dictionaries in Japan (Kanji
dictionaries) are in some kind of radical and stroke order. But it is
also true that the Japanese rarely use these dictionaries, since they
usually know how to pronounce the word they are looking up, and they
look up these words in dictionaries that are sorted in Kana (phonetic)
order.

Radical/stroke order dictionaries are a pain in the ass. You probably
won't hear a Japanese saying this, so I'll say it for them. :-)


> The "most common
> pronunciation" order is nice and familiar when it works, which it
> sometimes does.
> 
> Joe

Yes, the "most common pronunciation" order is nice when you are
ordering single characters. But to completely satisfy the ordinary
Japanese user, collation will have to be string-based, rather than
character-based. (Of course, as you say, most applications will cop
out and just do character-based sorting, in which case I think the
UniHan scheme is great.)

String-based sorting is desirable because of the change in
pronunciation of a character when it is combined with other
characters. Example:

	KAZE	(1 character)	means "wind"
	TAI FUU	(2 characters)	means "typhoon"

Here, the KAZE and FUU are the same character. The implications of
this are staggering. Not only do we need a large dictionary with all
the different pronunciations, but we may in some cases also need to
parse sentences. But this should probably be left to sophisticated
applications.

So what's the conclusion? As far as Unicode and collation are
concerned, UniHan is probably the way to go. ISO 10646 is somewhat at
a disadvantage in this respect. But 10646 has many other advantages
that far outweigh its disadvantages.


Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan      TEL +81-3-234-2692

