From chk@sol.kaist.ac.kr Thu Dec  5 19:11:44 1991
Received: from DAIDUK.KAIST.AC.KR by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA08970; Thu, 5 Dec 91 19:11:44 +0100
Received: from sol.kaist.ac.kr by daiduk.kaist.ac.kr (4.0/DAIDUK-0.1)
	id AA05139; Fri, 6 Dec 91 03:10:13 KST
Return-Path: <chk@sol.kaist.ac.kr>
Received: by sol.kaist.ac.kr (4.12/08-14-86)
	id AA05308; Fri, 6 Dec 91 03:09:20 KST
From: chk@sol.kaist.ac.kr (Chang Hyeoungkyu)
Message-Id: <9112051809.AA05308@sol.kaist.ac.kr>
Subject: Locale specific data manipulation
To: i18n@dkuug.dk
Date: Fri, 6 Dec 91 3:09:18 JST
Cc: unicode@sun.com
X-Mailer: ELM [version 2.3 PL0]
X-Charset: ASCII
X-Char-Esc: 29

* The subject was originally
*
*	"ko_KR.unicode@phonetic_sort" and
*	"phonetic sort".
*
* Now I changed it to
*
*	"Locale specific data manipulation"

Dear members,

This is somewhat long note. But if you are interested in the above
issue, please read this and comment to me. I really need your help.
Also I greatly appreciate your replies I have got.

-- 
Chang Hyeoungkyu  -  chk@sol.kaist.ac.kr
GUI Consortium / SA Lab., CS Dept., KAIST, KOREA

1. History

The term "locale" is defined as the combination of language, cultural
data and coded character set in Internationalization Guide (XPG4),
page 5.

I thought that the data manipulation will not be language, cultural
data, and coded character set neutral. I wanted to define a locale as
a combination of language, cultural data, coded character set and
their processing (manipulation).

I tried to find the appropriate examples. Input and output may be the
case. But I had better to find an example from POSIX defined locale
categories, which can not be expressed in the current "localedef".

The first example was:

	In "ko_KR.unicode@phonetic_sort" locale, string
	collation function should know Hanja-to-Hangul
	conversion rule, or should connect to such conversion
	server. (function pluggability)

But this was a wrong example as (Ken Whistler) writes:

	What you may not be aware of is that the merged ISO
	10646/Unicode will also contain a compatibility area which
	includes codes for all the KSC5601 multiple code values for
	Hanja--in addition to the main unified Han character space.

	This ensures that any application which uses KSC5601 can
	guarantee a one-to-one mapping into and out of ISO
	10646/Unicode code values. Thus, if a Korean application
	depends on the multiple code values for Hanja with different
	pronunciations, a means exists for preserving those values in
	ISO 10646/Unicode without specification of a particular input
	conversion mechanism.

In spite of the incorrectness, I have received some replies
encouraging me. Thanks to Tom McFarland, Walt Daniels, Ken Whistler,
John Entenmann, Malcolm Lithgow, Hiroko Kagawa, Keld J|rn Simonsen.

It is the motivation that I write this second note. :)

2. Requirement

2.1 Collating Arabic characters

	Source of information:

		Digital Guide to Developing International Software,
		Digital Press, 1991.

	Words are sorted in code order with the Arabic vowels
	characters excluded. Groups of words having the same
	consonants are then sorted in code order including the vowel
	characters.

	I don't think that this kind of information can be expressed
	in the current "localedef", personally.

2.2 Collating Japanese by phonetic sequences

	As (Malcolm Lithgow) writes:

	The Japanese use the Kanji in such a way that each character
	has a number of readings, sometimes more than twenty. The
	appropriate reading depends on the word that the character is
	in, and its position in that word. Sometimes it also depends
	on the position that the word is in a sentence. In summary,
	Japanese needs 'function-pluggability' for full usability, but
	don't expect to see anyone using it for quite a few years.

	As (Hiroko Kagawa) writes:

	In Japan, most Ideographic characters have several phonetic,
	depending on the context of the characters. In case of
	telephone directory, they define a representative phonetic per
	character and sort by it.

	My opinion:

	It is not possible to express Ideographic-to-Phonetic
	conversion information in the current "localedef".

	We have two choices. The first is to have locale specific
	function. That is to say, if we call strcoll(), then it should
	call strcoll_ja_JP@phonetic_sort().

	The second choice is that we extend the "localedef" to include
	this kind of information.

3. Design Alternatives

3.1 Wrapper function

	To do locale specific data manipulation, the functions defined
	to operate in the internationalized environment become the
	wrappers to locale specific functions.

	As (John Entenmann) writes:

	What exactly is "function pluggability" ? As described above
	it seems to be the ability to supply locale specific routines
	(as opposed to tables).

	Locale specific routines can be provided under the current
	global scheme as well. The LC_* cataegory can supply functions
	instead of or in addition to tables. For example, strcoll
	could look to LC_COLLATE for a sorting routine which would be
	specific to each locale.

	As (Tom McFarland) writes:

	I understand the concern and agree with your conclusions to
	the extent that we need to support function pluggability in
	the locale management and m_* function model. I believe the
	papers as written currently do support this model. For
	example, the locale management paper describes a locale object
	as an opaque structure that contains all the data, and
	optionally methods, necessary to process data in that locale.

	I envision locale objects being implemented to include method
	pointers for languages that require them. However, I don't
	believe that we can require any vendor to implement it this
	way. So I believe it is appropriate to ensure that the spec
	allows methods and leave it to the vendor to understand when
	methods are necessary.

	As (Chang Hyeoungkyu) writes:

	Mr. John Entenmann and Mr. Tom McFarland says that current
	global or non-global locale model can solve the problem of
	locale specific routines. It is true for model itself.

	Then what is the role of localedef ? In localedef source, we
	have only locale information data description. What about
	method ? We say that if foo locale is set up, the foo locale
	specific routines can be used.

	It seems to me that we should throw away "localedef" and just
	define interfaces of i18n routines, like m_* functions. This
	is true locale object, I think.

	As (Tom McFarland) writes:

	Mr. Hyeoungkyu raised an interesting point in earlier mail.
	Though I don't have the mail handy, I believe he asked how the
	method would be specified in localedef(). At our last meeting,
	we decided to divorce the concepts of global locale model and
	non-global locale model, so it isn't necessary to modify the
	X/Open/POSIX definition of localedef. We've said that the
	content and structure of a locale object is vendor dependent,
	so as far as I can tell we don't need to locale object
	structure. However, we *do* need to figure out how to register
	an algorithm as part of a registered locale (Keld's paper).
	The purpose of the registry was to ensure that a given
	registered locale would perform the same I18N functions,
	producing identical results, on different platforms. Since we
	allow a locale object to contain methods, we have to figure
	out how to register it. Any ideas?

	As (Tom McFarland) writes:
	
	If those specific problems can be solved by tables... great.
	But I don't think we can force people to use table look-ups.
	One of the advantages of going to opaque objects is that
	procedure pointers can be part of the object as well as data
	structures. This has the advantage of being able to use a
	procedure, very specific to my locale, for some/all I18N
	functions and still allow an application/library to be
	internationalized.

	For example, from the system vendors point of view, it may be
	more desirable in certain cases to write a simple procedure
	for a locale rather than try to make a single procedure (such
	as strcoll) to be generic for all languages; this allows my
	code to be small and efficient without having to fit into a
	general purpose architecture. It allows applications to be
	internationalized, since the procedure pointer is different
	(or non-existent) for different locales; the application
	doesn't know this - it simply creates a locale object and
	calls the m_* function with the object as an argument. The
	implementation of any m_* function can then check to see if
	the procedure pointer for that object is null; if it is,
	perform some default table lookup, otherwise, call the
	procedure using the pointer in the object.

	So to answer your original question... I don't have a specific
	problem I'm trying to solve. Rather, I'm looking at the
	architecture we defined and trying to cover all bases in how
	that architecture might be utilized. Part of the capability
	set we are providing with locale management and m_* is the
	ability to use locale specific methods to perform I18N
	functions. It is wholely reasonable to assume that some
	countries will want to register locales that include methods.
	(I know that if you were going to register private locales, HP
	would certainly want to include several. Methods will make
	our implementation cleaner and more efficient, and make our
	code space utilization smaller.)

	My opinion:

	In the Japanese case already described, one could retain the
	corresponding Kana of Kanji at input conversion time (I saw
	this in a letter of Glenn Adams distributed to
	unicode@sun.com). Such a tagged data handling would be more
	easy if we have locale specific routines.

3.1.1 Problems of wrapper function approach

	- Vendor overhead.

		The national profiles used in POSIX can be provided by
		national bodies.

		But a vendor should support all the locales with
		methods.

	- User defined locales

		We may think that all the locales are supplied by a
		vendor. It is not easy to add a user defined locale in
		this approach.

		If a locale "foo" is added by a user (or an
		application vendor, not a platform vendor), programs
		that were compiled before it does not know the "foo"
		locale at all. Only programs that are compiled after
		the "foo" locale was added know the "foo" locale.

3.2 Rule description in "localedef"

	The examples are only confined to LC_COLLATE category. So, I
	propose the another way.

	We can allow a "rule description language" for collation in
	localedef source. It should be in very restricted form. Our
	string collation functions should handle the rule description.

3.2.1 Problems

	This is just an idea. I didn't give deep thinking to it.
	Does it give us real flexibility that we need ?

4. Comments

	I'll appreciate any comments that encourage or discourage me.
	Please help me by giving your insight to this issue. :)