From erik@naggum.uu.no Fri Apr  5 23:37:59 1991
Received: from ifi.uio.no by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA01453; Fri, 5 Apr 91 23:37:59 +0200
Received: from uunic.uu.no by ifi.uio.no with SMTP 
	id <ABifi06102>; Fri, 5 Apr 1991 23:37:54 +0200
Received: by uunic.uu.no (UU2.3) id <1991-094-77494>
	  for <domo@tsa.co.uk>, <unicode@sun.com>, <i18n@dkuug.dk>;
	  Fri,  5 Apr 1991 23:31:34 +0200
Message-Id: <1991-094-77494@uunic.uu.no>
Date: Fri,  5 Apr 1991 23:31:34 +0200
From: Erik Naggum <erik@naggum.uu.no>
To: Dominic Dunlop <domo@tsa.co.uk>
Cc: unicode@sun.com, i18n@dkuug.dk
In-Reply-To: <12633.9104042108@tsa.co.uk>
Subject: Re: Sort sequence
X-Charset: ASCII
X-Char-Esc: 29

> Absolutely.  And absolutely not Unicode's problem.  Having recently had
> occasion to get into tagged text as defined by ISO 8879 (SGML), I
> discover that you can do this with tagged text and make it (more or
> less, depending on your preference) independent of character set.
> (Although SGML points to ISO 2022 for shifting from one character set
> to another, which was a problem for Unicode last time I was paying
> attention.) Once you've got tags specifying sort sequence (or anything
> else that takes your fancy), parsing software can detect and act on
> them -- presumably, in the C/POSIX universe, by calling setlocale().

A few notes on SGML, my field of expertise in this matter.

SGML is much more than "tagged text", which is both my and Charles
Goldfarb's major gripe about the popularization that SGML has seen.
SGML defines a language to specify the structure of information as found
in electronic text, among other things.  It also provides us with a way
to describe which character set we're using, and what each of the codes
mean.  This is for the benefit of moving documents around, more than for
"understanding" random character sets.

The "shifting" which Dominic alludes to is what is found in the multi-
code concrete syntax, in which you can specify that certain characters
shift you out of the "ordinary" character set (G0 in ISO 2022 parlance)
and other characters which shift you back in.  The sole purpose of this
is to notify the parser that anything found after a SHIFT OUT or LOCKING
SHIFT other than LOCKING SHIFT ZERO is "something else" which the
application is supposed to know how to handle, and the parser shouldn't
look at.  Come an Entity end for the entity in which the shift occurred,
or a SHIFT IN or LOCKING SHIFT ZERO, the parser again looks at the
characters.  In other words, as long as we're not in G0, all characters
are treated as data characters.  (A note should be made here, that if
you switch the character set in G0 to something else via an ISO 2022
escape sequence, you're more likely to fool both yourself and your
parser than to get good results.)

So all SGML does is to allow you to have codes in your SGML document
that could otherwise be interpreted as markup.  The document and the
parser are still very much dependent on character sets, although I
imagine that it would be possible to create an application which
understood more than one character set, i.e. which understood the ISO
2022 escape sequences for a limited set of character sets.  ISO TR 9573
has more on these issues.

I think it's important to stress that SGML doesn't know about character
set issues at all.  It demands, though, that you use ISO 646 for the
SGML declaration (which contains the document character set description
and a pointer to the concrete syntax, among other things).  The reference
concrete syntax has be be employed in the SGML declaration, for ease of
interchange.

As to tags specifying sort sequence, I very much doubt the usefulness of
such.  The tagging one employes for the data content is intended to
state what kind of information we're talking about.  Based on this
information, an application will have to know how to sort things, but
moving such considerations into the tags seems to violate the spirit of
SGML.  Sure, it can be done.  People have also used tags to denote in
which language a text is written, for purposes of elaborate multi-
lingual support, such as different quotes for different languages.  I
have found such attempts artificial so far, but they may have real uses
that I'm not aware of.

Not everything can be done with tagging.

> > One locale per sort rather dilutes the notion of locale, doesn't it?

> Are you suggesting that different keys in a multi-key sort should sort
> in different locales?  Well, why not?  Again, you could tag the stuff to
> indicate that this was what you wanted to happen.  And, I suppose that,
> given fair warning, somebody could cook up an application which acted
> correctly on the information.  Volunteers?

Ahem, it wouldn't be SGML if this was your purpose, although it would
take some of the "look and feel" of SGML for presentational purposes.

I appreciate your eagerness in applying SGML and tagging to almost every
problem around, but believe me, it's much more difficult than you anti-
cipate, and it won't yield the benefits we're looking for.  Something
else will have to be invented for multi-key sorts in multiple locales to
be represented cleanly.

> While I'm here, a quick locale story: I had occasion recently to visit
> the cockpit of an Airbus A320.  In it are screens showing kilgrammes of
> fuel, airspeed in knots or mach, height in feet and oil usage in quarts.
> (The captain did not know whether these were US or Imperial.)  If we ever
> get around to defining a UNITS locale, I think that cockpit is going to
> give us problems.  No doubt other industries will present other
> idiosyncracies...

Not to mention some of the things which are embedded in our language.
Time Magazine recently made some effort to "go metric".  I generally
applaud this, but one phrase got a little weird:

	Your kilometrage may vary.

That's almost as bad as the Boston newspaper which shunned all race-
related phrases and wrote about a local company which, after some
trouble, was

	back in the African-American.

My point is thus simply that we not overdo it.  Locales and tagging can
help us a long way, but they can't remove the problems from our lifes.

The other danger is that we make something so complex as to introduce a
factor by which to multiply the already mind-bogglingly large number of
states in even a moderately sized program.  Localizing the locale
support is not trivial, either.  More work should be dedicated to the
effect of locales on programming, before we solve problems that don't
exist or make solutions that won't be used.  Sometimes, I'm more
concerned about the programmers and the resulting programs than I am
about the supposed "user needs" that we're addressing.

[Erik Naggum]