From ALB@SHARE-E.ORGN.UK Fri Dec 10 02:57:00 1993
Received: from share-e.orgn.uk by dkuug.dk with SMTP id AA07842
  (5.65c8/IDA-1.4.4j for <I18N@DKUUG.DK>); Fri, 10 Dec 1993 03:53:10 +0100
Message-Id: <199312100253.AA07842@dkuug.dk>
Received: from SHARE-E.ORGN.UK by SHARE-E.ORGN.UK (IBM VM SMTP R1.2.2MX) with BSMTP id 0077; Fri, 10 Dec 93 02:58:15 GMT
Received:     from SEAS
              by MAILER(4.1.z);  10 Dec 1993 02:58:11 GMT
Addressed-To: SC21@DKUUG.DK Via MAILER
Addressed-Cc: I18N@DKUUG.DK Via MAILER
Addressed-From: ALAIN_LA_BONTE (Alain LaBonte' +1 418 644 1835)
Forwarding: Contents of another mailfile...
Subject: LID Protest
Date:    Fri, 10 Dec 1993  02:57 GMT
To: SC21@dkuug.dk
Cc: I18N@dkuug.dk
From: ALB <ALB@SHARE-E.ORGN.UK>
X-Charset: ASCII
X-Char-Esc: 29


----------------------   Forwarded Mail Follows   ----------------------

To:      SC22@DKUUG.DK Via MAILER
cc:      COMP@KOMP.ACE.NL Via MAILER
Subject:  A new language independent datatype for i18n: serving humanity
From:    ALAIN_LA_BONTE (Alain LaBonte'  +1 418 644 1835)
Date:    Fri, 10 Dec 1993  02:47 GMT

Willem Wakker (SC22/WG11 convenor) writes, as a resolution of a
Canadian comment asking for the inclusion of a ProcessableString datatype
in the Language Independent Datatype (LID) standard (CD):

>After extensive deliberations (including an exchange of views with the
>proposers of the datatype) WG11 has decided NOT to include the
>ProcessedString datatype in CD 11404.

I think this is a serious mistake. It seams to me that the principle of
having a processed or better name processable string which is the
which is a programming container  for the machine-processable binary
string equivalent to a character string semantics given a given
natural language is a must for getting out of the mess we are in
in character string compare/sort/search. My only wish is that those who
are against will have a more open mind if somebody comes back with that
again in the DIS ballot.

>The reasons for this decision can be formulated as follows:
>
>a)  Incomplete specification of the requested datatype.
>
>    The following documents were mentioned as base documents for the
>    required datatype:
>
>    - SHARE - Europe: White Paper on National Language Architecture (1990)
>    - CSA Standard Z243.4.1-1992, Canadian Alphanumeric Ordering Standard
>     for Character Sets.
>
>    A detailed analysis of these documents showed the following:
>
>    1. The SHARE document describing the "functionality" is aimed at
>       rendering foreign language characters into EBCDIC (and ISO 646) in
>       such a way that transliterations can be supported, collating sequences
>       can be defined, and orthographic conventions can be selectively used
>       or ignored in string comparisons (and presumably other processing).
>       It has examples, but does not DEFINE a complete form for ANY language.
>       This document is a technical report, identifying the things which must
>       be considered, and a consistent approach, in developing some elements
>       of what have become POSIX LOCALE files.

For collation (or compare/sort/search operations in other words) it describes
functions to start with national (why do you say *FOREIGN* ?) language
specs (and now we have the POSIX LOCALEs as a mechanism to specify tables)
to transform a string into a processing key (formed of 4 subkeys) directly
processable (comparable, with results allowing different levels of equivalence
plus correct ordering) by engines which know how to compare binary data,
whether the engine is a hardware process or a programming function. It also
describes the reverse function (out of the processable binary strings it can
restitute the original character data (in any code, in passing).

The request for a processable string datatype is about a container for
something which will be internally defined ias a blackbox result (according to
discussions I had with the editor). An argument used when I talked about
the similarity of what SHARE Europe designed and floating point was that
the LID std was not concerned with internal representation and that the
target data type (generic) for numbers was REAL and INTEGER. Why should it
be different with the processable string datatype? There is something I don't
understand in the argument used in the negative answer.

>    2. The cited Canadian standard defines the complete information needed for
>       the development of a ProcessedString datatype for French and English.
>       Additional facilities, however, would be needed to accommodate other
>       European languages, such as German, Swedish, Spanish and Dutch, while a
>       significantly different set of facilities would be needed for Russian,
>       Greek, Hebrew and Arabic.  And it is not clear that the SHARE approach
>       can be extended to Chinese and Japanese at all.  Thus, the cited
>       references do not readily lead to the formulation of a datatype which
>       extends to languages outside of Western Europe.

Righting: the Canadian standard specifically states (and this was checked
by IBM National Language Technical Centre and documented as the IBM reference
for these languages) that it supports at once not only English but German,
Dutch, Italian and Portuguese, not excluding others. An addendum has been
prepared to support all characters of the Latin script and many other
languages. Now Scandinavian has been proven to be supportable by exactly the
same method: the Scandinavians have developed POSIX LOCALEs (i.e. the
same engine as the Canadian std with *slightly* different tables) which do not
require additional facilities. The same has been done by IBM for these
languages plus Spanish plus Russian, Greek, Hebrew, Arabic and other languages.
It has been demonstrated that even if additional facilities were eventually
required, if an order is possible, it is possible to come to a
straightforward binary string which the ProcessableString datatype could
represent for programming languages (given n levels, n being equal to
4 in most languages - IBM documents Arabic with 5, but recent discussions
suggest it could also be handled in 4 - anyway it would be nice to have
access to a variable number of levels - at a minimum to a single string if
the LID std does not want to deal with a variable number of levels).
Maybe WG11 members did not find clear that the method was applicable to other
languages but it is: even Chinese, which is the most easy to support and
Japanese, whose Kanji transformation into kana is already done in Japanese
systems [kana is phonetic and as straightforward to describe in a locale as
the Latin script] and kanji is used to break ties when phonetics are equal -
all this can be reduced to a binary string of n levels too - n equal to
3 in this case a null 4th if desired... Chinese [if phonetics, i.e. pinyin is
opposed to pure Hanzi ordering] is handlable in the same way... Indic scripts
are more complicated but can also be reduced to a processable string of
n levels (in a fashion very similar to Arabic at the ordering level).

>   WG11 has therefore come to the conclusion that if LID is to contain
>    such a datatype, it must define the components thereof in such a way as
>    to be useful to more than a few National Bodies.  But, extension to
>    other languages (ideally to all languages covered by ISO 10646) requires
>    additional work, which should be the responsibility of WG20.

Extension to other languages (including Japanese) has begun in WG20 indeed
and an ordering standard is in preparation, which is using the same engine
as the Canadian [in other words now POSIX LOCALE's] model.
The ProcessableString, if the argument used in the discussions was not
one to discourage proponents, is already defined enough to be useable by
all member bodies, no proof of the contrary has ever been made, so it is
quite surprising to hear that opinion which is not based on solid ground.

>b)  Insufficient justification for inclusion.
>
>    Although it is recognized that `something like the ProcessedString
>    datatype' could be useful for certain fields of application (like the
>    POSIX Locale mechanism), this in it self is not enough justification
>    for inclusion of the datatype in the LID standard.

Well then for whom is the LID standard? for a handful of maniac programmers?
This is, with your respect, intellectual terrorism.

>
>    Moreover, the existence of ISO 10646 (which postdates the SHARE
>    document) guarantees that there is an international standard character
>    set, and therefore an LID Character-String datatype, in which all of the
>    information contained in the ProcessedString values can be exchanged.
>    While there is value added in processing the string into "textual
>    components", it appears that that processing is dependent on the natural
>    language and possibly the application.

It does not appear... It is dependent on the language, and can be made out
by variants required by a given application, which would make the force of such
a datatype. The specs come from outside the application, what do you want more?
I18N is built and has to be built with the philosophy that the comparison
engine provides the application with something that can fill the need of
different cultural environments without having to have the programmer care.
Providing a tool as powerful as a datatype to handle character strings without
having to care with semantics (to process 10646 strings it is a nightmare
because the semantics is complex and the programmer has no tool to face that
in a worlwide environment) is the thing to do, no way to bypass that...

>    The current proposal seems to be asking for a combination of a number
>    of character strings, with a certain semantic relationship between
>    these strings.  Such a datatype can easily be defined based on the
>    concepts already available in LID.  This should be done by the writers
>    of the interface by which the information conforming to this
>    datatype should be passed (this is similar to a language independent
>    definition of the FILE structure from the C standard).
>
>For the reasons mentioned above, WG11 considers it inappropriate to include
>the requested datatype in the document at this moment.  It is also
>recognized that when a complete proposal, based on concensus between
>the various technical expert groups involved, were forwarded to WG11,
>this could be included in a later revision of the standard.

I believe that no need to describe internals exists. This was served to me
as a previous argument and we agreed with it. Why is it now necessary?
The mechanism is well know. The data is provided in LOCALES and the
ProcessableString that we need is not a character string nor a combination of
character strings but a binary string in the same way as REAL and INTEGER are
internally binary strings... but like them special kinds of binary strings
which can be associated with character string data (in object programming)
and with the rules externally described in LOCALEs and with reverse rules too.

So if WG11 wants to serve humanity it should recise its decision.

Best Regards. Alain LaBont<e'>

Thanks. Again if I can help I'm still open and also open to other positive
contributions.

cc CPWG-MAIL@revcan.rct.ca, SHARE Europe NLA team, SC22WG20@dkuug.dk,
   i18n@dkuug.dk, SC22/WG11@dkuug.dk, gwarren@vnet.ibm.com

