From erik@naggum.uu.no Thu Apr  4 16:41:24 1991
Received: from ifi.uio.no by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA11580; Thu, 4 Apr 91 16:41:24 +0200
Received: from uunic.uu.no by ifi.uio.no with SMTP 
	id <ABifi03927>; Thu, 4 Apr 1991 16:11:22 +0200
Received: by uunic.uu.no (UU2.3) id <1991-093-50526>
	  for <I18N@dkuug.dk>, <.>; Thu,  4 Apr 1991 16:02:06 +0200
Message-Id: <1991-093-50526@uunic.uu.no>
Date: Thu,  4 Apr 1991 16:02:06 +0200
From: Erik Naggum <erik@naggum.uu.no>
To: I18N@dkuug.dk
In-Reply-To: <9104032245.AA17630@dkuug.dk>
Subject: SGML & Internationalization
X-Charset: ASCII
X-Char-Esc: 29

Welcome to the SGML fold, Dominic!

I have some comments on your interpretation of the internationalization
features of SGML.  SGML may look as if it gives you a lot of inter-
nationalization features, while in reality, it only aims to support
transportability of documents.

What you're thinking of (warning: ESP in use), is the SGML declaration,
which is a description of (among other things) the codes employed for
the syntax and document character set(s), _not_ the character sets
themselves.  There are BASESET and DESCSET keywords to be found in this
declaration, and all they say is that we will now base the description
on some known character set, basically identified with ISO 2022
sequences and ISO registration numbers.  What these do, is a little
hard to figure out, since the parser may ignore these declarations for
the document character set!  The document is already in the document
character set when the parser gets it...

The purpose is to identify, to the human who moves the file(s), what
character set was used, so that it's possible to figure out what to do
to get it into a form useful for parsing.  Of course, in many cases
this can be mechanized, but the intention is clear:

13.1.1.1 Base Character set

    The public identifier is a human-readable identifier of the
    base character set.

    NOTE - For example, a standard or registered name or number, or
    other designation that will be understood by the expected recipients
    of the document.

    The public identifier should be a formal public identifier with a
    public text class of "CHARSET".

The purpose of the base set is to be one from which you select
characters with the DESCSET phrase.  The DESCSET phrase can remap the
entire character set, or select characters from several base character
sets.  The document uses the described character set.

While this scheme allows for coded character sets of many kinds, its
purpose is not to facilitate a parser which can know about them all.

Let's say we provide a DESCSET which describes a document character set
equivalent to the ROT13 "encryption" scheme employed on USENET.  The
parser is then expected to operate in "ROT13 native mode", _not_ to do
the mapping from the ROT13'd text into the base character set.

I'm not 100% satisfied with the above clarification/explanation, so if
anybody have questions, I suggest we air this in comp.text.sgml, and
get back here when we have solved any problems.  To get other people's
opinions in, I ask permission from both you and Keld to quote from your
articles and produce a posting to comp.text.sgml.  (Please reply to me,
not the list.)

--
[Erik Naggum]					     <enag@ifi.uio.no>
Naggum Software, Oslo, Norway			   <erik@naggum.uu.no>