From keld@dkuug.dk Fri Apr 17 20:50:31 1992
Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8)
	id AA24045; Fri, 17 Apr 92 20:50:31 +0200
Date: Fri, 17 Apr 92 20:50:31 +0200
From: Keld J|rn Simonsen <keld@dkuug.dk>
Message-Id: <9204171850.AA24045@dkuug.dk>
To: i18n@dkuug.dk
Subject: Johan van Wingen: liaison from SC22 to SC2 on 10646
X-Charset: ASCII
X-Char-Esc: 29

Forwarded for your information  /Keld

This is the definitive text of my Liaison Report as sent to SC22
Secretariat. Now I go for our Eastern holidays. Tuesday I am back.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 INTERNATIONAL ORGANIZATION FOR STANDARDIZATION


                                           ISO/IEC JTC1/SC22 N


                                                       J. W. van WINGEN

                                                             1992-04-06

0                                                           VERSION 1.1

- REPORT OF THE LIAISON REPRESENTATIVE OF SC22 TO SC2

  A FINAL CONTRIBUTION TO THE DISCUSSION ON
  DIS 10646.2, MULTIPLE-OCTET CODED CHARACTER SET
0

 SUMMARY

 In this report some material is presented to enable national members of
 SC22 working groups to form an opinion on the issues arising with the
 standard development undertaken by SC2/WG2. The document in question is
 DIS 10646-1.2 (Universal Multiple-octet coded Character Set, Part 1,
 Basic Multilingual Plane, Second DIS).

 The DIS 10646-1.2 is now out for vote, ending 1992-05-30. Because the
 result could affect your work considerably, you should contact your
 national JTC1 in order that all interests be reflected in the vote,
 not only that of your national SC2.

 In SC22 N 1091 I presented some fragments from SC2/WG2 N 745, a draft
 (the latest available, but not the final one). Attempts at obtaining
 the amended version all failed. Now that the DIS itself has been issued
 I checked whether the final text would cause me to change my
 conclusions. This appeared hardly to be so. Thus my recommendation
 remains the same:

 -----------------------------------------------------------------------
 Your National Body should vote NO, UNLESS Level 2 be removed from the
 DIS 10646-1.2 (Universal Multiple-octet coded Character Set, Part 1).
 -----------------------------------------------------------------------

 This level introduces the use of combining and composite characters.
 This feature makes the whole concept of coded characters ambiguous, and
 counting the number of characters in a string dependent of the chosen
 representation (that is, not defined, because in the same context more
 than one is permitted). It means a threat to Application Portability,
 and to Open Systems.

 Some crucial fragments of the 2nd DIS are added in an Annex, in order
 that you can form your opinion yourself. The following pages discuss
 the matter in more detail, but are mostly taken from my previous
 report.
1
 SOME CONSIDERATIONS ON THE BASIC CONCEPTS OF DIS 10646.1-2

  Goethe, Faust Teil I, 2555-2559

  Mephistopheles.  Das ist noch lange nicht vorpber,
  Ich kenn es wohl, so klingt das ganze Buch;
  Ich habe manche Zeit damit verloren,
  Denn ein vollkommner Widerspruch
  Bleibt gleich geheimnisvoll fpr Kluge wie fpr Toren.

  Mephistopheles.  It will go on that way for a long time,
  I know it well, the rest is just like that;
  I have lost a lot of time on it,
  For a perfect contradiction
  Remains just as mysterious to fools as to men of sense.

 There are several passages in the given text that are unclear or even
 contradict each other. Definitions 4.11 and 4.13 are new to SC2 and do
 not occur in any previous standard. They leave much open to
 interpretation.

 In ISO 6937 "diacritical marks" are specified, coded with a single
 octet. They are not characters, but some kind of prefix operator.
 A graphic character may be coded with a single octet, or with two, the
 first of them corresponding to the coding of a diacritical mark, the
 second with a letter coded with a single octet. There is only one list
 of permitted characters, the "repertoire" of ISO 6937, giving a name to
 each. Accidental combinations of a diacritic and a letter are not
 allowed.

 In the same way a "combining character", such as is introduced in the
 Second DIS 10646, could be interpreted as some postfix operator.
 Apparently, however, any combination of this operator and any other
 graphic character is allowed, even that of a Greek letter and an Arabic
 diacritic. It is not clear what the name of the resulting character
 would be. In all SC2 standards a character is identified by its name.
 This principle is violated in 23.4, 2): "..... the character "<" that
 is represented by 4 octets (in UCS-4) ....." A character "A" may be a
 LATIN CAPITAL LETTER A, or a GREEK CAPITAL LETTER A, or a CYRILLIC
 CAPITAL LETTER A. Thus this sentence has no defined meaning. Different
 names mean different characters, even if their visual representation
 looks the same.

 The point is what is meant in definition 4.13 with "A graphic character
 the graphic symbol of which consists of the combination of the graphic
 symbols of another graphic character and those of one or more combining
 characters". Does it mean that the result is one single character and
 counted as such in a string, or that what is counted are the originals
 (def. 4.13)? In other words, makes 1+1=2 or still 1? Or can 1=2 under
 some conditions?  In character and text processing one wants to know
 whether two characters are identical. But finding out that is not
 facilitated if the same character may be either simple or composite in
 the same context.

 If "combining character" does not mean a character at all, but a "mark"
 in the sense of ISO 6937, some characters are obviously coded with two
 units (2 or 4 octets each), and others with one. This implies the
 mixed coding method that has always been rejected by the programming
 languages people.
1But still worse, a given character may be coded in more than one way,
 like is explained in 23. We look again to 23.4, 2): "..... the
 character "<" that is represented by 4 octets (in UCS-4) can also be
 represented by the 4 octets of LATIN SMALL LETTER A followed by the 4
 octets of COMBINING GRAVE ACCENT....".

 Thus a LATIN SMALL LETTER A WITH GRAVE can be coded with 4 octets, in
 UCS-4, by using those specified in Table 2 of the 2nd DIS, or with 8,
 the diacritic being separated from the letter. This could even occur in
 the same string. In this way the principle of unique representation is
 discarded. But just this is what is stated in definition 4.9 as being
 an essential property of a CODED CHARACTER SET.

 It is not clear whether 4 octets representing a character, followed by
 4 other octets representing another character, establish a new name. If
 so, its name cannot be derived from the rules for constructing names.

 If we look at definition 4.13:  COMPOSITE CHARACTER again. What makes
 a character a candidate for also being a composite character? Is LATIN
 SMALL LETTER A WITH DIAERESIS a composite character? It may be argued
 that the diaeresis is a detachable part of that letter. But for people
 in Sweden it is simply a letter like all the others in the alphabet,
 just like the i and the j where the dot is a thing belonging to the
 letter proper. That the name includes WITH DIAERESIS does not mean
 anything. The LATIN SMALL LETTER L WITH STROKE has nothing to detach
 from. A STROKE is not a diacritic. Thus we cannot determine from its
 name whether a character is composite or not. Splitting a letter into
 parts is an arbitrary way of doing, "decomposition", in fact. With this
 method we could specify a "b" as a SMALL LETTER O WITH HIGH BAR LEFT, a
 "d" with it RIGHT, a "q" as a SMALL LETTER O WITH LOW BAR RIGHT, etc.
 Then someone may call our b, d, p, q as being "precomposed".

 With Level 2 the creation of files is made possible that cannot be
 processed by many devices, because these are not based on confused
 concepts. If a device can only interpret a single coding unit as a
 single character, composite characters cannot be processed. There is no
 fall back mechanism specified in the DIS, to specify what to do with a
 base character and the following combining character if they can only
 be displayed as separate characters on the receiving device. This
 implies that the user has to buy more expensive hardware, replacing his
 simple things that cannot handle the new codes. Thus these files are
 not portable and a threat to open systems, from being incompatible with
 previous practice in coding.

 Should, on the contrary, it be made possible to display two separate
 characters as one single graphic symbol, or as two, just as a result of
 selecting a display option by the user, most of the definitional
 problems disappear. A combination of characters may represent a
 typographical unit or a linguistic object, but only if the user wants
 to see it that way, in his application. Devices that are able to do
 that already exist, like many fototypesetting displays and the Digital
 Equipment VT340 terminal.
1
 CONCLUSION

 Characters are abstract things and independent of any visual
 appearance. Speaking about "interaction of combining characters"
 confuses the concept of character and glyph. Introducing the concept of
 "composite character" makes it unclear what a character really is.
 Some people may say that atoms and molecules are the same thing, being
 the smallest particles of which a given matter consists. But one cannot
 do any serious chemistry by not distinguishing the concepts.

 Despite having minor defects, like the badly designed allocation of
 characters to code positions, DIS 10646-1.2 presents a sound basis for
 supporting a datatype "character" in programming languages, databases
 and networks standards, but only if it is restricted to Level 1.
 Inclusion of Level 2 will create a disaster to all existing
 dataprocessing systems that are confronted with it, and poses
 insuperable problems to new applications at attempting to implement it
 in a way consistent with other standards. My urgent advice to all SC22
 working group members and national delegates is to exert maximum
 pressure to their national bodies to vote on the Second DIS 10646 NO
 UNLESS Level 2 and all related material will be removed from it.