From greger@iuk Wed Dec  5 20:29:38 1990
Received: from [128.212.16.14] by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA10567; Wed, 5 Dec 90 20:29:38 +0100
Received: by ism.isc.com (Sendmail5.61/1.35)
	id AA07368; Wed, 5 Dec 90 11:33:02 -0800
Received: from friherr 
	by iuk.isc.com (5.61/smail2.2/11-14-88)
	id AA13614; Wed, 5 Dec 90 18:39:25 GMT
Received: by  (5.61/1.35/jcb-s)
	id AA00626; Wed, 5 Dec 90 18:52:52 GMT
Date: Wed, 5 Dec 90 18:52:52 GMT
Message-Id: <9012051852.AA00626@>
To: Mark_E_Davis.PINKTEAM%gateway.qm.apple.com@ism,
        unicode%noddy.eng.sun.com@ism,
        Internet_UniCore.PINKLINK%gateway.qm.apple.com@ism, i18n%dkuug.dk@ism
From: greger@ism.isc.com ("greger@ism.isc.com (Greger Leijonhufvud, ISC, High Wycombe, U.K.)")
Subject: Re(2): 10646 Advantages
X-Charset: ASCII
X-Char-Esc: 29

In reply to your message of Fri Nov 16 20:03:28 1990
-------

The following is a comment/follow-up to Dominic Dunlop's message dated
Nov 16 1990.

I quote from Dominic's answer:

>   The net effect of this is that the ISO POSIX working group (!) is
>   currently running with the issue because it needs a solution: the
>   UNIX shell and tools embody collation and related concepts
>   (filename expansion and listing, the sort command, regular
>   expressions), and a corresponding international standard must be
>   internationally applicable.  Work in progress suggests that, by
>   making up to four passes backwards and forwards through text,
>   assigning different weights (including ``ignore'', ``high'' and
>   ``low'') to each encoded character encountered on each pass, you
>   can achieve useful real-world collation.  Although you probably
>   can't do a telephone book sort even in New York, never mind Tokyo.

>   Our work has been based primarily on encodings without the
>   non-spacing diacritics (accents) of Unicode.  If it turns out that
>   we can't accommodate these, we'll think again: the ability to
>   handle Unicode is at the very least an important proof of concept
>   for us.  (My feeling is that, compared to the handling of stateful
>   encodings with locking shifts -- something else that we intend to
>   accommodate -- non-spacing diacritics should be a piece of cake.)

The current collation scheme in POSIX.2 supports collation specification
with n number of passes (n = 2-4, typically); free assignment of weights
per pass, including IGNORE; multi-character collating elements, 1-to-many
mapping, different evaluation order per pass (forward/backward), and
equivalence classes. It supports the proposed Canadian standard collating
sequence, which is the most complex I have seen in official documents so
far. Certainly, it cannot do advanced telephone book collation (which is
often based on phonetics), but will do a creditable job of dictionary
ordering for Western languages.
In this context, the multi-character collating element feature is of
interest. It allows the specification of character sequences to be collated
as an entity. An example of the use is the Spanish 'ch', which collates
as an entity between 'c' and 'd'. Another, and pertinent in this context,
is with code sets employing non-spacing characters (which we tend to
frown upon, for obvious reasons; they are difficult to handle in programs,
as the character width becomes variable); you define all characters that
are created via non-spacing characters as a multi-character collating
element made up of the non-collating element and the 'base character'
(e.g. 'A).

Greger Leijonhufvud
INTERACTIVE Systems
High Wycombe, UK
greger@{iuk,ism}.isc.com
