From ALB%SEAS@liverpool.ac.uk Thu Jan 23 22:21:12 1992
Received: from danpost2.uni-c.dk by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA27755; Thu, 23 Jan 92 22:21:12 +0100
Received: from vm.uni-c.dk by danpost2.uni-c.dk (5.65/1.34)
	id AA19667; Thu, 23 Jan 92 21:20:53 GMT
Message-Id: <9201232120.AA19667@danpost2.uni-c.dk>
Received: from vm.uni-c.dk by vm.uni-c.dk (IBM VM SMTP V2R1) with BSMTP id 1209;
   Thu, 23 Jan 92 22:21:16 DNT
Received: from UKACRL.BITNET by vm.uni-c.dk (Mailer R2.07) with BSMTP id 3727;
 Thu, 23 Jan 92 22:21:15 DNT
Received: from RL.IB by UKACRL.BITNET (Mailer R2.07) with BSMTP id 2204; Thu,
 23 Jan 92 19:22:27 GMT
Received: 
           from RL.IB by UK.AC.RL.IB (Mailer R2.07) with BSMTP id 4246; Thu, 23
                Jan 92 19:22:27 GMT
Via:            UK.AC.LIV.IBM; 23 JAN 92 19:22:21 GMT
Received:       from ALB@SEAS by MAILER(4.1.a);  23 Jan 1992 19:04:01 GM
Addressed-To:   I18N@DK.DKUUG Via MAILER
Addressed-From: ALAIN_LA_BONTE (Alain LaBonte O1 418 644 1835)
Forwarding:     Contents of another mailfile...
Subject:        I thought this could be of interest here.
Date:           Thu, 23 Jan 1992  19:03 GMT
To: I18N@DKUUG.DK
From: ALB <ALB%SEAS@liverpool.ac.uk>
X-Charset: ASCII
X-Char-Esc: 29


----------------------   Forwarded Mail Follows   ----------------------

To:      SC22WG20@DK.DKUUG Via MAILER
From:    ALAIN_LA_BONTE (Alain LaBonte  O1 418 644 1835)
Subject: Symbolism inside and outside programs
Date:    Thu, 23 Jan 1992  18:53 GMT

David Joslin writes:

:(d)  const herald = '$';   (ALB's note: $ represents the BEL character)
:
:     ...................
:
:     if ch = herald then ....
:
:This also makes the program clearer to read.

Of course, this is standard programming technique. But Herald to me is not
clearer (the example you chose is good, I know this is the name of an English
newspaper but I had to look in a dictionary to see what it really meant, don't
laugh (-:), but that's OK because an Englishman will maintain it. It is OK as
long as I am not precluded to use mnemonics that are meaningful in my mother
tongue. So inside a private program this name is OK but a naming convention
outside of programs must be more universal and needs to be efficient (short -
it's for computers as well as for humans) and natural-language-independent.

:(f)  const e_acute = chr(233);  {ISO 8859-1}
:
:     .........................
:
:     if ch = e_acute then ....

Comparisons are not always that simple though and they are almost always more
complex in fact. Results of comparisons must also be in line with ordering
and results can be: "the first string" is before or after or it is approximately
equal ("HERALD"<->"herald"). SHARE Inc. (USA) even proposed to SHARE EUROPE
(which requested such functionality) to add EQUALITY-EXCEPT-SPECIALS
("vice versa"="vice-versa"), EQUALITY-EXCEPT-CASE ("C>OT/E"<->"c>ot/e") and
EQUALITY-EXCEPT-DIACRITICS ("C>OT/E"<->"cote"). We found the idea a good
generalisation of the STRUCTURED ALPHABETIC DATA TYPE we proposed (in fact
ordering is also done that way in POSIX), so that you can structure alpha data
in the way floating point is structured by analogy:

Floating point: <Sign><Exponent><Mantissa><Mantissa><Mantissa>
Alpha String:   <Script Base><Diacritics><Case><Specials>

Wherever you truncate starting from right to left, you only lose precision,
but not the essence of the number (or of the string message).

So symbolism must be done in a more sophisticated way, that is not only at the
data level but at the process and data structure level. In fact comparison
operations should be done using such a structure, with sophisticated
functionality given at the language level. Ordering is just a special
operation also using comparisons at its lowest layer. But many more operations
are necessary and need consistency. SHARE EUROPE also defined FUZZY-EQUALITY
based on phonetic rules (which depends much more on the language).

In the Qu/ebec Government we have developed all these functions and we're about
to modify our data bases to make such consistency occur on character string
processing (for names, addresses of customers, and so on, but it could also be
used very efficiently in word processing). The only function we don't deal
with is fuzzy equality (some departments have routines for that that handle
French phonetics, but it is quite inefficient to make that a part of standard
systems so far, for huge data bases where other languages are also involved).

By the way, so far, even in English, problems arise in data bases: If
"O'BRIEN" is written that way in a data base and your search for, say,
"O' Brien", there is no way you can retrieve the information, in the typical
case. The tools have to be reinvented by the dp craftsman, and consistency
of handshaking is impossible to maintain in these conditions.

If I talk about my name in a few data bases it is written "LA BONTE", in
others "LA  BONTE" (2 spaces), or "LABONTE", or "LaBONTE", or "La Bont/e"
and so on. In our normalized data bases, these cases would be:

    <labonte>  <acute on the last e>  <Upper Lower Lower>  <Space in pos. 3>

as an example for the last case (the structure being efficiently coded in binary
(I just put here the meaning). All other cases will share the first part.

Interesting, isn't it? In 10 years, ISO will understand, maybe before.
In the meanwhile, we do it, and I heard that a few shops did it (based on our
model) in the U.S. just to deal with upper, lower case and specials (and it
was not more difficult to deal with accents, so they also did it). When will
this be integrated in languages so that we can deal as easily with characters
as with floating point numbers (the definitions are there, the algorithms are
know and have been very efficiently implemented).

                   Alain LaBont/e
                   Minist<ere des Communications du Qu/ebec

c.c. WG14, I18N
Btw I have alos dreamed of an analog scheme for Chinese. Simplified characters
could be normalised, for example, in the first part, and discriminated for
more precision, in the second part. In this way comparisons would be simpler.
For all other scripts, so far the same kind of structure is applicable, but
the number of parts may be more than 4 (Arabic requires 6 parts). The
POSIX LOCALEs deal with such a structure in the LC_COLLATE definitions.
The Canadian standard CSA Z243.4.1 defines the structure and the Default
Ordering Standard for the Universal Character Set (DIS 10646) will make use
of the same kind of structure, possibly adding a higher order level, the
script identifier.
