From keld@dkuug.dk Sun Feb 24 18:58:03 1991
Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8)
	id AA13206; Sun, 24 Feb 91 18:58:03 +0100
Date: Sun, 24 Feb 91 18:58:03 +0100
From: Keld J|rn Simonsen <keld@dkuug.dk>
Message-Id: <9102241758.AA13206@dkuug.dk>
To: i18n@dkuug.dk, iso10646@jhuvm.bitnet
Subject: Re:  (wg14 44) AT&T Bell Labs wishes for shorthand character names
Cc: npn@sirius.att.com, wg14@dkuug.dk
X-Charset: ASCII
X-Char-Esc: 29

Nils-Peter Nelson writes:

> I've been working with Brian Kernighan on an ISO 8859-1 version
> of troff.  Brian has already modified the code to accept the
> 8 bit input, and my group is currently working on the ditroff-
> to-PostScript conversion for the additional characters. 

> As a favor to ASCII people we want to preserve the troff convention
> of providing ASCII digraphs for the new characters; however, we
> now see that the troff conventions differ from the commonly used
> VT200 terminal digraphs commonly used world-wide. As an example,
> the British Pound sign is \(ps in troff, but is typed as
> <COMPOSE> L - on most ISO 8859-1 terminals. My way of handling
> this is to change troff to use the VT200 digraphs in the future.

Well, then which is the most common notation: troff or VT200?
I would guess (at least among troff users) that the troff 
convention is the more used.

> I've spoken to Dennis Ritchie several times because he faces
> similar problems with C-- even if variable names are ASCII he
> wants to be able to handle ISO 8859 strings.

Yes that would be handy. ISO WG14 is looking at the problem.
I am a member of this WG and I have an action item of
providing a quite general international C locale, and also we
are discussing general strings with access to the symbolic character
names. This discussion is also going on in the POSIX i18n forum.

I think WG14 would be happy to invite Dennis Richie to participate
in this discussion.

> What would avoid all the sturm-and-drang is if the ISO committees,
> as they agree on things like 8859-1,2,3,... and 10646 would
> provide *as part of the standard* the "shorthand" notation in
> the next "lower" character set. As an example, in 8859-1,
> character 10/03 might be defined by:
> 10/03	L-	POUND SIGN
> (The digraph in this case must be two ISO 646 characters.)
> The standard would also recommend that, where 8 bit keyboards
> were not available, the sequence <COMPOSE> L - would be
> equivalent to 10/03. <COMPOSE> could be undefined-- on keyboards
> it's a key, in troff it would be \(, in C it might be something
> else.
> ISO 10646 would require a tetragraph, but again, it should be one
> recommended by the standards committee.

Yes, this has been proposed within ISO. 

Actually it was SC22 who proposed this to SC2 - to provide unique
naming and short identifiers to all characters provided by SC2.
The SC22 requirements were stated in the paper SC22 N622R.
SC2 responded by assigning unique (long descriptive) names for
characters in the new universal character code ISO (DIS) 10646.
But they did not want to provide short identifiers.

As SC22 needed this for various purposes, like the C and POSIX
standards, a NWI has been proposed and accepted by SC22.
This NWI covers internationalization (i18n) and includes character set
work on identifying and conversion within character sets.
The text is not fully clear on the shorthand requirement,
but I think it is sufficiently clear that this work is included.
The NWI is assigned to the new SC22 WG20 on internationalization.
They have not met yet. The convenor is Dick Weaver of IBM, he
is on (at least) the i18n@dkuug.dk list and thus gets these messages.

> Since you announced yourself to be the "contact person for
> the general public" I'm asking you to bring this to the
(Keld: "you" here means "Thom Plum".)
> attention of the various committees. If the standardization
> is not offered by ISO, we run the risk of different conventions
> in troff, TeX, C, MS-DOS, etc.

Yes, I share your concerns. And they are shared by the ISO POSIX WG15
RIN rapporteur group on i18n (of which I am a member). We surely
would like to avoid this mess.

I can give you an overview of what work I know is going on in the
field of naming characters.

1. ISO 10646 is defining long descriptive names to be used in all
   ISO SC2 character work. There are rules for the syntax of the names,
   including that the letters in the names must be capital. I have a
   SC2/WG3  paper N65 (April 1989) on these rules, but that might not
   be the most recent.  An example is:

         CYRILLIC CAPITAL LETTER ER

   Some of the namings are available freely and electronially from
   dkuug.dk:i18n/ISO_10646 - this contains also shorthands provided
   by Danish Standards as described below.

2. ISO 6937-2 & ISO 10367 have a short naming of the Latin
   characters and also some special characters. These shorthands
   are four-character with the 2 first being capital letters
   and the last two being digits. An example:

          LA12

   (which identifies some Latin A with an accent - I cannot remember which)

   The naming is available freely and electronically in Johan
   van Wingens work, as noted below.
   
3. SGML - ISO 8879:1986 the Standard Generalized Markup Language
   is (one of) ISOs answer to troff, TeX etc. There are quite some
   shorthands there - I think they are mostly made up from upper
   and lower case letters. An example is:

           <Aring>

   which means LATIN CAPITAL LETTER A WITH RING (10646 name).
   I do not know if these specs are available electronically.

4. POSIX has a standard naming for the ASCII characters which
   are used in the POSIX locale. They may differ a little from
   the 10646 names, but not much, and then they are in lowercase.
   An example:
       
           <percent sign>
   
   The naming is available as part of the Danish POSIX locale
   as noted below.

5. Danish Standards (the Danish ISO member body) has produced an
   elaborate "Example Danish National Locale" for POSIX, included
   in the POSIX.2 draft 10 (published a bit later than the rest of
   draft 10) and also in the next draft. I have been very active
   in producing this specification. There are shorthands for a
   considerable part of ISO 10646, covering many alphabetic and
   ideographic characters, some 25000 characters in all (1300 non-
   ideographic). IMHO it is the most elaborate work available today
   on shorthands. Mostly the shorthands are two-character from
   the invariant ISO 646 set (ASCII minus 12 characters), but
   longer names are also permitted and used for ideographic characters.
   An example:

          R= 
   
   for CYRILLIC CAPITAL LETTER ER
   It is freely and electronically available in dkuug.dk:i18n/ISO_10646
   and dkuug.dk:pub/ch.shar* . The work is used as a basis for work on
   POSIX locale specifiacton, for ISO C international locale, for OSI
   work and for other communication work (Internet).

6. OSI ISOCHARSTRING - SC21 decided on their meeting in Berlin Oct
   1990 to make a new ASN.1 string specification, the ISOCHARSTRING.
   There the long descriptive names of ISO 10646 are used, stripping
   spaces in the name and converting all letters except the initial
   of each word into lowercase. An example is:

          CyrillicCapitalLetterEr

7. Johan van Wingen from Nederlands Normaliserings Institut (the Dutch
   ISO member body) has a convention for character naming, which is
   two-character and drawn from ASCII (I think). It is used in his
   survey of which languages requires which characters, and also
   how these characters are collated in each of these languages.
   The papers are avaliable electronically - one source is the 
   iso10646 archive at jhuvm.bitnet.
   
8. Troff conventions: The original Ossanna specifications had
   quite some shorthands for non-ASCII characters. Some other
   conventions building on the Ossanna specs have been done.
   I was the coauthor of one (together with Ed Keizer and Jaap
   Akkerhuis) which was discussed on the net recently.
   This article is available freely and electronically
   from dkuug.dk. An example is:

         \(*a

   for GREEK SMALL LETTER ALPHA

9. VT200 has the compose character function, which consists of
   a special compose character and then two characters from the
   normal ASCII keyboard. 

10. TeX and other formatting packages.
    I do not know too much on these, but TeX does have shorthands.
    TeX is available freely and electronically from various
    sources (I am not sure where).
    WordPerfect also have a shorthand, but that is just numbers.
    Other word processing pachages surely also have their conventions.

11. C - The ISO WG14 C committee is working on an addendum to
    the ISO C standard ISO/IEC 9899:1990 (technically equivalent
    to the ANSI standard). I have an action item on producing a
    proposal for an international C locale, building on the Danish
    POSIX work.

12. Alain LaBonte' of the Canadian Standards Association is working
    on a shorthand, especially for chinese characters, as far as
    understand. I have not seen this work, though.

13. X windows is not naming characters, but has something that comes
    close. In the X Input Methods for Japanese, a way of generating
    Kana and Kanji characters from ASCII is provided. You may 
    consider this as shorthands. There may be other Input Methods
    defined for other character sets.
    
    Also X have written a specification for i18n POSIX locales 
    with the requirement that shorthand names (symbolic character
    names) should be given in a restricted ASCII, namely the 
    inveriant part of ISO 646, which contains 12 less characters.
  
    The X fonts and names are freely available from MIT, also
    electronically (expo.lcs.mit.edu - as far as I remember).
    It is HUGE.

14. IBM has a naming of letters which is much like the ISO 6937-2
    naming, but is 8 characters. They use it for specification of all
    their character sets. One document where it is used is
    SE09-8002-01 on Natural Language Support.

Keld Simonsen
