From keld@dkuug.dk Fri Feb 22 21:10:35 1991
Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8)
	id AA15064; Fri, 22 Feb 91 21:10:35 +0100
Date: Fri, 22 Feb 91 21:10:35 +0100
From: Keld J|rn Simonsen <keld@dkuug.dk>
Message-Id: <9102222010.AA15064@dkuug.dk>
To: i18n@dkuug.dk
Subject: i18n by Danish Standards
X-Charset: ASCII
X-Char-Esc: 29

Here is an article that I have sent to UniForum for consideration.

keld
----
Internationalization work by Danish Standards (DS).
By Keld Simonsen, Danish UNIX systems User Group

The DS work on characters and character sets originated from the ISO/IEC
JTC1/SC22 special working group on character set usage in programming
languages. They made in April 1989 some requirements about these issues,
and put them forward to SC2 - who then said that they would only do
a small part of the work. SC2 has since then adressed the issue of character
naming in the ISO/IEC DIS 10646:1990 - where many (and eventually all)
of the characters in the whole world are listed and given a unique
(long) descriptional name.

The DS work has then taken off from the 10646 standard:

1. Short mnemonics have been allocated to about 1300 letters and
   special characters and about 24000 ideographic characters of 10646.
2. About 100 character sets have been tabulated with these mnemonics,
   including almost all of the ISO (ECMA) registry and about 40
   vendor character sets.
3. About 300 names and aliases have been allocated for the character sets.
4. Attributes to each of these characters have been given in
   POSIX.2 localedef notation (alpha, lower,upper,contrl,
   toupper, tolower etc).
5. A collating sequence in POSIX.2 format has been defined for the
   language Danish. This follows the Danish standard DS 377 (1980).
   The collation sequence is defined on all the 25000 characters, thus
   making it possible to have the same collation sequence defined for
   the 100 character sets tabulated.
6. With the encoding of the 100 character sets it has been possible to
   define a conversion between almost all of these, with a fallback
   representation consisting of an indicator and the character mnemonic.
   Rutines have been written in C for this purpose.
7. The conversion rutines have been build into sendmail for providing
   multi-character set mail, and this is employed by 10 sites in Denmark.
The work is freely avaliable from dkuug.dk by ftp, ftam and
email (archive@dkuug.dk).

The work is now documented or employed in:
1. ISO/IEC IS 9945-1:1990 (the POSIX kernel standard) in the informative
   annex E.
2. IEEE POSIX 1003.2 Shell and Utilities draft 10 (and the forthcoming
   draft 11) in the informative annex F.

It has been presented to SC22 special working group on character set
usage, to the European Workshop on Open Systems (EWOS), to the ISO/IEC
JTC1/SC22/WG14 C language group, to RARE and IETF and some vendors,
including X/Open.
Work is ongoing in all of these fora based on this work, although
the work has not yet been endorsed by any of them.

Also the work has been presented to ISO/IEC JTC1/SC22/WG11 (progamming
language independent features) as a mechanism to provide general
character set encoding independent strings.

Danish POSIX locale.

For the POSIX.2 standard a quite general character set encoding
independent character specifications and collating sequence was
produced.

Together with POSIX charmaps this can be used to sort identically
for a lot of character sets, including 10646 and almost all of
the ECMA registry, and some 30 vendor specific character sets.

There are many levels of complication for collation. For example
the telephone level, with Mc the same as Mac, numbers spelled out,
certain words like "the" ignored or moved to the end etc.
Actually Danish has some rules like that, also in the official
collating standard DS 377 from 1980. Another level is the phonetic level
- soundex, which is a little less complicated. A third level is
transcripted characters, as the librarians use when they see a
greek alpha and order that as a normal "a". 

The level that Danish Standards have decided on for its POSIX.2 locale
is the systems interface level. The collating order should be usable
in POSIX systems tools like ls and sort. A requirement has been that
it is deterministic, if two strings are different they will also differ
when compared. Another issue has been efficiency. POSIX has provisions
for substituting "Mc" with "Mac", but this is considered too inefficient
and avoided in the Danish example national locale.

The problem of pronounciation and translitteration has not been
addressed. Instead it had been considered adequate just to look at
the characters themselves - only considering characters at the
systems level - and not sounds. The level provided by the Danish locale
is a service for comparing strings which are intended for a replacement
to the standard strcmp() etc rutines, just a little more intelligent
and adhering to Danish collating rules.

We have however put as much intelligence in there as possible at
this level. The two letters <a><a> are sorted as the single letter
<aa> (A WITH RING), but the <aa> single letter is before <a><a>
in homonyms. The 4 level scheme of the Canadian-French sorting is being used,
with the four levels being letter, accent, case and special character.
This was actually also specified in the DS 377. In cause of harmonization
we decided to use the reverse sorting for the accents as the Canadians
do; the natural choice may have been forward sorting here too,
but as most of these words would be of French origin anyway, we
decided to follow their rules. For <ss> we implemented what we
think is the German rule, as seen in several German dictionaries.
<ss> is ordered as <s><s> but before it in homonyms.

For the accents there was some indicated rules in the DS 377 and in the 
official Danish orthography dictionary, but it was far from complete.
I think we have about 25 accents ordered.

For the non-latin scripts we decided not to transcribe.
This also allows us to use the native collation order for these
scripts, like alpha, beta, gamma for Greek and a be ve ghe
for Cyrillic. Accented Greek and Cyrillic letters and ligatures
have been put into the right places.

International C locale

In ISO WG14 we plan to provide an international locale for C, including quite some
accented characters. Thus a quite general collating specification
can be produced. The Canadian Standards Association has produced a
specification for French-Canadian which can also be used conveniently for
US English. We plan on a collating sequence which could be used for
French, English, German, Dutch, Italian, Japanese, Chinese, Arabian,
Hebrew, Russian, Greek and maybe other languages; that is, languages
which just have the normal collation order of their scripts.

Thus a "standard" Latin order is considered, alongside with a standard
Greek order (alpha, beta gamma), standard Cyrillic, Arabic, Hebrew,
Bopomofo and Kana. A selected list of special characters and their
ordering is provided too.

Keeping the various scripts apart in the collating specification
has the advantage of being able to cover
quite a lot of cultures without discrimination: the Japanese
have their characters in the "right" order, and the Russians
and the Arabs, together with most Latin users.