From erik@sran8.sra.co.jp Tue Feb 19 06:54:44 1991
Received: from mcsun.EU.net by dkuug.dk via EUnet with SMTP (5.64+/8+bit/IDA-1.2.8)
	id AA24326; Tue, 19 Feb 91 06:54:44 +0100
Received: from [133.137.4.3] by mcsun.EU.net with SMTP;
	id AA01860 (5.65a/CWI-2.74); Tue, 19 Feb 91 06:57:39 +0100
Received: from srava.sra.co.jp by srawgw.sra.co.jp (5.64WH/1.4)
	id AA16640; Tue, 19 Feb 91 14:59:03 +0900
Received: from sran8.sra.co.jp by srava.sra.co.jp (5.64b/6.4J.6-BJW)
	id AA15768; Tue, 19 Feb 91 14:57:03 +0900
Received: from localhost by sran8.sra.co.jp (4.0/6.4J.6-SJ)
	id AA10461; Tue, 19 Feb 91 14:55:38 JST
Return-Path: <erik@sran8.sra.co.jp>
Message-Id: <9102190555.AA10461@sran8.sra.co.jp>
Reply-To: erik@sra.co.jp
From: Erik M. van der Poel <erik@sra.co.jp>
To: i18n@dkuug.dk
Subject: Re: (i18n 74) Re: paper presented to WG11 (#20)
Date: Tue, 19 Feb 91 14:55:36 +0900
Sender: erik@sran8.sra.co.jp
X-Charset: ASCII
X-Char-Esc: 29

> 	char *str = "s\<o/>ndag";
>  [...]
> 	3. Compile the \< token into an otherwise unused character
> 	   code and use that at run-time to trigger "on the fly"
> 	   substitution in all STANDARD routines operating on strings,
> 	   e.g. strcpy and printf.
> 
> Kim F. Storm

We should try to resist the temptation of attaching special meanings
to "otherwise unused" codes. One of the versions of the cpp program
used the code 0xfe for something special. This caused core dumps when
used with Japanese characters. (Why have *another* special code? Why
not just use `\'?) [Of course, if the combination `\<' causes a core
dump, we're still up the creek...]

It is of course easy to modify e.g. strcpy and printf to handle the
new escape sequences, but frequently the programmer needs to process
strings within his/her own program, rather than within a "standard"
routine. Simple operations like counting characters become rather
complex when escape sequences are present. So, in this sense, it would
be easier to use a fixed-width, stateless code like 10646's 4-byte
form. E.g.:

	wchar_t	*ws = L"s\<o/>ndag";

Even when the programmer insists on using `char' instead of `wchar_t',
we need to make sure that strings read in from a file and strings
compiled into a program are encoded consistently. This could be done
by having a compiler that compiles strings into, say, 10646, and
having files in 10646 too. Or we could leave the strings as they are
and store files in a <o/>-like format, but I don't think this is
Keld's intention.

Of course, interpreting the strings at compile-time will hard-code
info from a particular charmap, but then that is one of the reasons
for having a universal coded character set. :-)

Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692