From domo@tsa.co.uk Tue Feb 12 11:08:31 1991
Received: from mcsun.EU.net by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA29073; Tue, 12 Feb 91 11:08:31 +0100
Received: from kestrel.ukc.ac.uk by mcsun.EU.net with SMTP;
	id AA15648 (5.65a/CWI-2.72); Tue, 12 Feb 91 11:11:30 +0100
Received: from slxsys by kestrel.Ukc.AC.UK   with UUCP  id aa04535;
          12 Feb 91 10:08 GMT
Received: from tsa.co.uk by specialix.co.uk
	id aa09126; Tue, 12 Feb 91 9:30:27 GMT
From: Dominic Dunlop <domo@tsa.co.uk>
Date: Tue, 12 Feb 91 09:16:25 GMT
Message-Id: <13962.9102120916@tsa.co.uk>
In-Reply-To: kyongsok kim <kkim@plains.nodak.edu>
       "SSC and Hangul" (Feb 11, 11:20)
X-Fax: +44 491 651751
X-Phone: +44 491 652590
X-Address: 9 The Forty, Cholsey, OXON OX10 9LH, U.K.
X-Organization: The Standard Answer Ltd.
X-Mailer: Mail User's Shell (7.1.2 7/11/90)
To: kkim@plains.nodak.edu, davis.mark@applelink.apple.com
Subject: Unicode, pattern matching and repertoires [Was Re: SSC and Hangul]
Cc: unicode@sun.com, i18n@dkuug.dk, uniforum-intl@sun.com,
        asmusf%microsoft@relay.eu.net
X-Charset: ASCII
X-Char-Esc: 29

[Asmus -- please treat this as a formal comment on Unicode 1.0.  I am
making it as an individual, not as a representative of any corporation
or organization.  Please acknowledge receipt.]

[From "SSC and Hangul" dated Feb 11]
> 
> > Let's consider a simple string search operation.  We want to find all
> > occurrences of /l/su/rog/ You will enter LSuRog as a search string
> > although you need to find lSuRog.
> 
Not sure that I understand the subtleties of this example, but anyway...
The internationalization extensions proposed to the traditional UNIX
regular expression mechanism by the draft POSIX 1003.2 shell and tools
standard may help: they allow the definition of arbitrary ``collating
symbols'' which can be called out in match strings and matched as single
units.  Does this solve the problem?  Does it even help?

> By "loose" matching criteria, I
> guess you want to ignore spacing or non-spacing characteristics of
> characters.

You can't ignore this issue.  The proposed POSIX mechanism can handle
non-spacing diacritics.  Essentially, the combination of a spacing
character and one or more diacritical character is defined to be a
single collating symbol.  The current glaring holes in Unicode in
respect of such things are:

1.  Unicode does not define a repertoire.  Actually, I don't think this
    is a problem: those who define sets of collating symbols for -- say
    -- national or corporate use will, in effect, be defining a private
    repertoire.

2.  Unicode defines no conventions for applying non-spacing diacritics
    to spacing characters: 
    
    -- it does not state which spacing characters are valid targets for
        diacritical characters;

    -- it does not adequately define the ordering of spacing and non-
       spacing characters intended to fill a single display position.
       (While the diacritical characters are defined as following the
       spacing character in the data stream; the ordering of multiple
       diacritical marks ``from the center of the base character
       outward'' (Unicode 1.0 Final Review Document, page 3) seems to
       me to leave scope for multiple interpretations.  While an
       ordering by the encoding of the diacritical mark would displace
       ``visual'' considerations with an arbitrary programming
       convention, it would be unambiguous.);

    -- it does not prohibit multiple applications of the same
       diacritical character to a single spacing character;
       
    -- it does not prohibit the application of diacritical marks to a
       base character where the result would have the same appearence
       as a character to which a single Unicode cell has already been
       allocated.

Unless something is done about the second point, everybody will be able
to define their own repertoires -- which is fine -- but they'll all
disagree on the representation of particular characters -- which will
result in anarchy.  Worse, the developers of word-processors and such
will not know which representation of particular characters they should
use in the files created by their products, and consequently will each
go their own way, creating truly horrendous data interchange problems.
This would be a tremendously retrograde step.

-- 
Dominic Dunlop
