From keld@login.dkuug.dk Sun Dec 20 17:38:08 1992
Received: from login.dkuug.dk by dkuug.dk with SMTP id AA21438
  (5.65c8/IDA-1.4.4j for <i18n@dkuug.dk>); Sun, 20 Dec 1992 16:37:30 +0100
Received: by login.dkuug.dk (4.1/SMI-4.0)
	id AA25436; Sun, 20 Dec 92 16:38:08 +0100
Date: Sun, 20 Dec 92 16:38:08 +0100
From: keld@login.dkuug.dk (Keld J|rn Simonsen)
Message-Id: <9212201538.AA25436@login.dkuug.dk>
To: i18n@dkuug.dk
Subject: plan 9 and 10646
X-Charset: ASCII
X-Char-Esc: 29

Here is some illustration on how to use 10646, picked from news.
/keld

In article <1gtrpdINN6c4@corax.udac.uu.se>, andersa@Riga.DoCS.UU.SE (Anders Andersson) writes:
> [note Followup-To: comp.std.internat]
> 
> In article <1gt5a2EINNin3@uni-erlangen.de>, unrza3@cd4680fs.rrze.uni-erlangen.de (Markus Kuhn) writes:
> > It should also be noted, that at least one existing OS (Windows NT)
> > uses a 2 byte encoding both internally (e.g. in filenames in Fnodes
> > on the disc) as well as in text files. Text files always begin with
>                           ^^
> > FEFF as a magic code for ISO 10646 textes. This code also indicates,
> > whether it is a littleendian file.
> 
> Is this magic code visible to the user without any special tricks,
> or is it filtered away by the operating system when the file is
> opened for reading?  Suppose I obtain a file, that is labeled as
> containing IS 10646 text, via FTP from a server running Windows NT,
> to a client running a different system--will I then get this 0xFEFF
> magic code (which is meaningless on my system) too, or will I get a
> 'clean' IS 10646 text?
> 
> I remember seeing text files containing an explicit ^Z (0x1A) at
> the end, due to their origin on some home computer where ^Z was the
> ordinary EOF marker, even though I was sitting on a system with
> perfectly functional EOF pointers in the file descriptor blocks...
> 
> I hope the above isn't yet another version of that problem (non-
> standard tags or markers floating around with standards-compliant
> data on systems not understanding them)?
> 
> Alternatively, does this magic code have any chance of becoming
> a standard itself?
> --
> Anders Andersson, Dept. of Computer Systems, Uppsala University


	This is a quite complicated set of questions that strikes at the
heart of how to handle 10646 text streams and even how to migrate
to where you can handle them.

	firstly, we can answer what FEFF is. it is not a character as such
(in fact, it and FFEF are defined as never being characters). the meaning,
defined in 10646, is that the following byte stream *should* be in MSB first
order (FFEF indicating LSB first). note that this is informative and
not normative.

	how would you use such a thing? good question. what i tell you now
is not in any standard. there is a convention, proposed by the unicoders,
that text streams have a FEFF as the first 16 bits that is not part of
the text stream; the FEFF serves as a byte-order indicator. (it has never
been clear to me how FEFFs after the first 16 bits are handled.) so, for
a program like cat would strip the FEFFs from all its inputs, swab'ing
those inputs who were FFEFs, and emitting one FEFF before the catenation
of the (processed) inputs.

	can we answer the question yet of what happens to an ftp'ed file?
not yet. before we can do that, we need to know a little about your system.
one way of classifying systems is whether or not files are typed.

	in unix (and plan 9), files are not typed from the system
point of view. there are certainly files which are typed at the application
level; archives, exectuables. these files typically need to be massaged
by specific sets of utilities. for example, the tools for manipulating
archives are ar, ranlib, and ld. however, it is legal to cat archives,
and copy them with cp, and so on and we know this in advance, because
the system just views these files as byte streams.

	in such systems, how do you handle 10646 text? if you insist upon
the FEFF header, then ALL the utilities handling text have to acquire
code to handle the header (and the byte-swabbing that seeing an FFEF implies).
this also means a new cat (otherwise cat'ing together two binary files
might indavertently cause one to be swabbed). it means that all tools
that might handle text have to know when their input is text so that they can
handle it. of course, some folks say that all the files on their system
will have the same byte order and thus the FEFF is not necessary. this
is a plausible position but highly restrictive; it fails utterly in the
presence of networked filesystems.

	the other kind of systems support explicit typing of files.
this would allow you to designate a file as being 10646-LSB, S-JIS,
8859-1, or whatever. such systems will find migration easy but of course,
typed files have problems too: what is the type of a pipeline?
what is the type of the output of cat (when its inputs have different types)?
of course, such systems are often small, closed universes (like the world
according to Word) with carefully planned allowable user actions.
such system are limited but can be really smooth to the user.

	there is also the problem of migration; how the hell do you
migrate to this new scheme? it amounts to being able to guess what the
text files are on your system and converting them (to 16bit chars or whatever).

	the next system issue is deciding on the representation of
text streams throughout the system. for example, as the body of a file?
as an argument to a system call? as an argument to a library call?
as text for display by the user's display? these are not necessarily
the same. for example, the system might allow either byte sex as the
contents of a file but insist on one byte order (and thus no FEFF)
for library and system calls and display (this example has difficulties
if your display is actually networked to another architecture).
by and large, this becomes a mess of interfaces each trying to guess
how other parts need their text streams.

	for mostly these reasons, Plan 9 chose a byte-stream encoding
(initially UTF-1 and then UTF-2) and applied it uniformly according
to a single rule: all byte streams interpreted as characters shall
be interpreted as a sequence of 10646 characters encoded as UTF-2.
this applies everywhere: it applies to the kernel and file server,
it applies to the window system and the user's display, it applies
to names in archives and tar files. and best of all, the existing
system and its text is, because we were an ascii site, already
correctly encoded. (actually, we were a Latin-1 system, but we were
willing to make user's convert latin-1 text to the new format.)

	normally, such a solution
requires everything entering/leaving the plan 9 universe be converted.
however as the encoding we use is backward compatible with ASCII,
no conversion needs be done for the only important case (text files on
networked filesystems). it also has the advantage that all programs
can display text uniformly; users don't have to write S-JIS editors
because the regular editor (sam or ed) edits kana/kanji just fine.
all the conversion effort can be, and is, confined to one place
(a program called tcs [translate character sets]). the hope is
that is most cases, this conversion can happen automatically
(which is how this stream arose originally; the case of mail
and news should be easy to make happen).

	to finally come back to the original question, one would presume
that ftp simply transfers files without diddling them and as such,
if the original had an FEFF, then the result would as well. a more
agressive ftp might convert to local format, inserting the FEFF as necessary,
but this would require another mode (don't want compress'ed files
swabbed, do we?) for transmission.

	finally, you must understand that 10646 doesn't mandate
solutions to any of these issues. it has accomplished an admirable job
in that we can now unamibguously refer to explicit characters.
however, as i hope i have shown above, there is much more to the job
of converting and migrating to a ``10646 system''. i believe plan 9
was the first such system, mainly because we had the will and the source
(and rather less of it than most systems). there are still lingering
problems, mainly talking to other systems (for example, most of
the printers we use are postscript printers driven from unix machines;
it has been a long and tedious process to get them to understand 10646
characters), but on the whole, within Plan 9, it just works.

	i believe these system (design and migration) issues have been
essentially ignored in all the work and fuss on unicode/10646.
i know that deep within unicode and in places like X/Open, there are
efforts to develop support libraries for wide characters but this simply
ignores the system issues.


			andrew hume