From keld@dkuug.dk Sat Nov  9 16:50:41 1991
Received: by dkuug.dk (5.64+/8+bit/IDA-1.2.8)
	id AA10498; Sat, 9 Nov 91 16:50:41 +0100
Date: Sat, 9 Nov 91 16:50:41 +0100
From: Keld J|rn Simonsen <keld@dkuug.dk>
Message-Id: <9111091550.AA10498@dkuug.dk>
To: i18n@dkuug.dk
Subject: RIN working guidelines for creating locales
X-Charset: ASCII
X-Char-Esc: 29

Title:	Guideline for producing a national POSIX locale
Source:	Danish Standards Association
Date:	1991-11-09
Status:	Working guideline; to be revised at next RIN meeting

The POSIX.2 locales and charmaps mechanisms
===========================================

The POSIX.2 standard (ISO/IEC 9945-2 currently 2nd CD 1991)
defines general mechanisms for handling arbitrary character sets,
and also to define cultural items, such as date and time formats,
number and currency formats, collating sequences, messages and 
classification for characters. Such a specification for a language
or country or other culture is called a POSIX national locale.

The mechanisms are specified in POSIX.2 chapter 2.4 and 2.5, and they
are employed in the informative annex F, which describes how
this can be used for the country of Denmark. This guideline
draws on the experiences in producing the example Danish locale,
and also the Japanese locale.

POSIX.2 also provides mechanisms to specify the cultural items
in a character set independent way. This is done
by giving symbolic names to characters and collating-items, 
and then using these symbolic names in the locale definition,
which is then used together with various "charmaps" defining the binding
of the symbolic character names to the codes of a character set.
The source of the locale is compiled with the relevant charmap
into a binary locale which is then ready for use by applications etc.

The Danish locale and charmaps
==============================

Work with the Danish locale produced a quite elaborate locale defined
for a lot of character sets, including parts of ISO/IEC 10646 (currently
about to be published as 2nd DIS 1991) and almost all of
the ISO 2375 registry (done by ECMA), and some 60 vendor
specific character sets, in all about 120 character sets.
The locale is available electronically together with the 120 charmaps.

Thus with just one specification of a national locale, uniform
collating for many character sets is defined - the characters will
always come in the same sequence regardless of which character set
employed. Also there is just one definition of date format and the
other cultural items to be done, and that specification is then
valid for many character sets.

WG15RIN thus recommends that only character set independent
locales be specified, and that the symbolic names of POSIX.2 annex F
be used. This removes the pain of producing a locale for every
character set supported, the pain of producing another set of charmaps,
and creates consistency with other locales.

Specifying cultural data
========================

In creating a cultural locale, many things must be considered.
Some data may be easier determined than others. For instance
the character classification section of the locale is normally
straightforward; an "A" is considered a letter in about all languages
and is mapped to an "a" when the lowercase letter should be found.
Normally the LC_CTYPE definition in POSIX.2 Annex F can be used
without change. The data in LC_NUMERIC, LC_MONETARY, LC_TIME and
LC_MESSAGES is normally not that difficult to determine.
In producing the Danish locale we had some slight problems with the
date format, including time zone names, which were not well defined.
We consulted as many official sources
as possible, including orthography definitions and numeric rendering
standards. One thing we changed late in the process was to write
day names with an initial small letter - which was in accordance with
the Danish orthography dictionary.

Collating specification considerations
======================================

But the Danish collating sequence was harder to define.
There are many levels of complication for collation. For example
the telephone level, with Mc the same as Mac, numbers spelled out,
certain words like "the" ignored or moved to the end etc.
Actually Danish has some rules like that, also in the official
collating standard DS 377 from 1980. Another level is the phonetic level
- soundex, which is a little less complicated. A third level is
transcripted characters, as the librarians use when they see a
greek alpha and order that as a normal "a". 

The level that Danish Standards have decided on for its POSIX.2 locale
is the systems interface level. The collating order should be usable
in POSIX systems tools like ls and sort. A requirement has been that
it is deterministic; if two strings are different they will also differ
when compared. Another issue has been efficiency. POSIX has provisions
for substituting "Mc" with "Mac", but this is considered too inefficient
and avoided in the Danish example national locale.

The problem of pronounciation and translitteration has not been
addressed. Instead it had been considered adequate just to look at
the characters themselves - only considering characters at the
systems level - and not sounds. The level provided by the Danish locale
is a service for comparing strings which are intended for a replacement
to the standard strcmp() etc rutines, just a little more intelligent
and adhering to Danish collating rules.

We have however put as much intelligence in there as possible at
this level. The two letters <a><a> are sorted as the single letter
<aa> (A WITH RING), but the <aa> single letter is before <a><a>
in homonyms. The 4 level scheme of the Canadian-French sorting is being used,
with the four levels being letter, accent, case and special character.
This was actually also specified in the DS 377. In cause of harmonization
we decided to use the reverse sorting for the accents as the Canadians
do; the natural choice may have been forward sorting here too,
but as most of these words would be of French origin anyway, we
decided to follow their rules. For <ss> we implemented what we
think is the German rule, as seen in several German dictionaries.
<ss> is ordered as <s><s> but before it in homonyms.

For the accents there was some indicated rules in the DS 377 and in the 
official Danish orthography dictionary, but it was far from complete.
Then the accent sequence in several ISO standards were used, when
there were no clear Danish rule. About 25 accents have been ordered.

For the non-latin scripts we decided not to transcribe.
This also allows us to use the native collation order for these
scripts, like alpha, beta, gamma for Greek and a be ve ghe
for Cyrillic. Accented Greek and Cyrillic letters and ligatures
have been put into the right places.

The sequence of the scripts was taken as in the ISO 10646 draft.
That should solve the question on which scripts should come
before others. Current scripts addressed are: Latin, Greek,
Cyrillic, Hebrew, Arabic, Kana and special characters. Ideographic
characters are in the works.

Together with the Danish collating
sequence a more general collating sequence was specified.
This collating sequence could be used as a reference sequence,
as mentioned below, and it should produce an order which is compliant
with at least English, French, German, Italian, Dutch, Portuguese,
Greek, Russian, Hebrew and Arabic.

WG15RIN recommends that similar decisions are taken
when producing a new collating sequence.

The cultural specification by the "replace-after" statement
===========================================================

Actually many of these considerations
is not a cultural dependent specification, but different views on
general collating procedures. WG15RIN is also working on specifications
where the truly cultural dependent collating is separated from general
considerations. This is done by specifying differences from a reference
collating sequence, by taking characters out of the reference
collating and moving them to another place. 
In the Danish example, specifying differences from a reference locale
reduced the specification of truly Danish cultural specification
from about 70 pages to 3 pages.

The replacing mechanism has also the advantage of being applicable to
several different collating sequences, where different general
collating sequences are employed.

WG15RIN recommends using replace-after techniques when specifying
national collating sequences. When an elaborate reference locale
is available, this can reduce the effort of producing a national
locale to a quite manageable size, at least for laguages using
alphabetic scripts.

The WG15RIN locale collection
=============================

WG15RIN is collecting locales and making them available electronically.
This is done to help the availability of POSIX locale specifications.
Rules for submitting locales to WG15RIN is available in the paper
ISO/IEC JTC1/SC22/WG15/RIN N____ .

Danish locale using "replace-after" in LC_COLLATE
=================================================

escape_char	/
comment_char    %

% Danish example national locale for the language Danish
% Source: Danish Standards Association
% Revision 1.8 1991-07-04

LC_CTYPE
copy INTC
END LC_CTYPE

LC_COLLATE
copy INTC
% <CAPITAL> letters before <SMALL> letters
replace-after <CAPITAL>
<CAPITAL>
<BOTH>
<SMALL>
% Greenlandic Kra is sorted as <q>
replace-after <q>
<kk>       <Q>;<SPECIAL>;<SMALL>
replace-after <'y>
% <U:> and <U"> are treated as <Y> in Danish
<U:>       <Y>;<ACC11>;<CAPITAL>
<u:>       <Y>;<ACC11>;<SMALL>
<U">       <Y>;<ACC12>;<CAPITAL>
<u">       <Y>;<ACC12>;<SMALL>
replace-after <z<>
% <AE> is treated as a separate letter in Danish
<AE>       <AE>;<NO-ACCENT>;<CAPITAL>
<ae>       <AE>;<NO-ACCENT>;<SMALL>
<A:>       <AE>;<DIAERESIS>;<CAPITAL>
<a:>       <AE>;<DIAERESIS>;<SMALL>
<A3>       <AE>;<ACC3>;<CAPITAL>
<a3>       <AE>;<ACC3>;<SMALL>
% <O//> is treated as a separate letter in Danish
<O//>      <O//>;<NO-ACCENT>;<CAPITAL>
<o//>      <O//>;<NO-ACCENT>;<SMALL>
<O:>       <O//>;<DIAERESIS>;<CAPITAL>
<o:>       <O//>;<DIAERESIS>;<SMALL>
<O">       <O//>;<DOUBLE-ACUTE>;<CAPITAL>
<o">       <O//>;<DOUBLE-ACUTE>;<SMALL>
% <AA> is treated as a separate letter in Danish
<AA>       <AA>;<NO-ACCENT>;<CAPITAL>
<aa>       <AA>;<NO-ACCENT>;<SMALL>
<A-A>      <AA>;<ACC1>;<CAPITAL>
<A-a>      <AA>;<ACC1>;<BOTH>
<a-a>      <AA>;<ACC1>;<SMALL>
replace-end

END LC_COLLATE

LC_MONETARY

% int_curr_symbol according to ISO 4217
int_curr_symbol         "DKK "
currency_symbol         "kr."
mon_decimal_point       <,>
mon_thousands_sep       <.>
mon_grouping            3;0
positive_sign           ""
negative_sign           <->
int_frac_digits         2
frac_digits             2
p_cs_precedes           1
p_sep_by_space          1
n_cs_precedes           1
n_sep_by_space          1
p_sign_posn             4
n_sign_posn             4

END LC_MONETARY

LC_NUMERIC

decimal_point           <,>
thousands_sep           <.>
grouping                3;0

END LC_NUMERIC

LC_TIME

abday       "s<o//>n";"man";"tir";"ons";"tor";"fre";"l<o//>r"
day         "s<o//>ndag";"mandag";"tirsdag";"onsdag";/
            "torsdag";"fredag";"l<o//>rdag"
abmon       "jan";"feb";"mar";"apr";"maj";"jun";/
            "jul";"aug";"sep";"okt";"nov";"dec"
mon         "januar";"februar";"marts";"april";"maj";"juni";/
            "juli";"august";"september";"oktober";"november";"december"
d_t_fmt     "%a %d %b %Y %T %Z"
d_fmt       "%d %b %Y"
t_fmt       "%T"

% The AM/PM notation is not used in Denmark and thus not allowed.
am_pm       "";""
t_fmt_ampm  ""

END LC_TIME

LC_MESSAGES

% Must be careful to avoid interpreting "nej" (no) as "ja" (yes).

% yesexpr     "^[[:blank:]]*[JjYy][[:alpha:]]*"
% noexpr      "^[[:blank:]]*[Nn][[:alpha:]]*"

yesexpr     "<'//><<(><<(>:blank:<)//><)//>*<<(>JjYy<)//><<(><<(>:alpha:<)//><)//>*"
noexpr      "<'//><<(><<(>:blank:<)//><)//>*<<(>Nn<)//><<(><<(>:alpha:<)//><)//>*"

END LC_MESSAGES
