From alb@sct.gouv.qc.ca  Thu Mar 12 15:28:58 1998
Received: from lys.sgo.gouv.qc.ca (dns1.gouv.qc.ca [192.197.162.1]) by dkuug.dk (8.6.12/8.6.12) with ESMTP id PAA25805 for <i18n@dkuug.dk>; Thu, 12 Mar 1998 15:28:51 +0100
Received: from laba01 (ppp126.gouv.qc.ca [192.197.162.126]) by lys.sgo.gouv.qc.ca (8.7.1/8.7.1) with SMTP id JAA20808; Thu, 12 Mar 1998 09:19:02 -0500
Message-Id: <3.0.1.32.19980312092547.0078f8ec@entree.sct.gouv.qc.ca>
X-Sender: alabonte@entree.sct.gouv.qc.ca
X-Mailer: Windows Eudora Pro Version 3.0.1 (32) [F]
Date: Thu, 12 Mar 1998 09:25:47 -0500
To: unicode@unicode.org
From: Alain LaBonté  <alb@sct.gouv.qc.ca>
Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)
In-Reply-To: <9803121037.AA21262@unicode.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

A 02:37 98-03-12 -0800, Hallvard B Furuseth a =E9crit :
>I wrote:
>
>>> In particular, I wonder about
>>> character ranges: If the user says "[=C0-=C5]" in his 8-bit charset (=
not
>>> latin-1),
>
>Please note that I used two accented characters.  Let's use the example
>[=D8-=DF] instead, it's easier to see that they are non-ASCII.
>
>I'd be delighted if anyone has made a sensible definition of [7-bit ->
>8-bit] ranges too, but I didn't want to complicate the issue too much.
>Maybe that was silly, though -
>
>Jeroen Hellingman wrote:
>
>> This opens up a whole range of challenges. I would say that this will
>> depends on the user's locale. for example, if I am Danish, I would
>> expect all letters from A to A-ring to match, if I say [A-=C5],
>> according to the Danish alphabet. I England I might expect to get all
>> letters A, irrespective of any accents on them.
>
>I disagree there.  I would expect a character range like [A-=C5] to be t=
he
>characters numbered from char('A') to char('=C5') in the charset I am
>using.  (For me, latin-1.  Which is an uninteresting example because the
>useful character codes are the same as in Unicode).
>
>Or maybe non-ASCII character ranges would simply be forbidden.  If so,
>can anything replace them?  The exist because they are useful...
>
>I won't expect programs to give all characters the correct collating
>sequence in my language -- if nothing else, because a program often
>can't known which language it is looking at.  It only knows the charset.
>Sometimes it can ask the user about the language, but not always.
>
>> It would be quite unexpected to match allmost all chaharacters
>> if some user enters [A-Z], when the Z happens to
>> come from the compatibility zone at the high end of Unicode.
>> This means you'll have to do some locale defined normalistion on your
>> data before pattern matching, comparable with sorting and searching
>> operations.
>
>Agreed.
>
>> I wouldn't bother about the original charset, when using
>> Unicode, the user expects Unicode.
>
>The user may not even *know* about Unicode.  If he does and that's "his"
>charset, everything is wonderful.  But I was thinking of the situation
>where the *user* is basically using some 8-bit character set and the
>*program* is using Unicode (and translates input from the user's charset
>to Unicode).  Then we'll either have to dump regexp character ranges, or
>define some way the program can know when the user means a range of his
>native characters, and when he means a range of Unicode characters, or
>define some equally useful alternative to ranges.
>
>--=20
>Hallvard


[Alain] :
Although this kind of practice is, if we talk about general-purpose
appplication, a very bad programming technique, as long as there won't be=
 a
firm international standard convention (unfortunately there is a defacto
standard [quite "international" among computer specialists] in some
programming languages to that effect, exactly what Halvard expects), ther=
e
is in the couple of standards projects that are ISO/IEC 14651 and 14652
(under current ISO/IEC FCD ballot), a convention that is established in
practice to ease ellipsis definitions.

14652 describes the form  <character symbol 1>...<character symbol 2> to
define a coded-character dependent ellipsis (well, what you call a "regul=
ar
expresion", thanks to remind me this very ambiguous term which I had forg=
ot
the meaning -- we just saw it in revising the ISO 14652 standard last wee=
k
and we did not know what it was refering to exactly, it seems to be a
C-language-specific expression, but I'm not a C-language specialist)...

14652 also defines (it is Keld Simonsen's proposal, fine with me) for the
needs of ISO/IEC 14651 (Intenrnational String Ordering [and matching]
Standard) two dots instead of three, to define a *code-independent*
ellipsis, using the UCS code as the international ellipsis reference,
regardless of the actual coding used...

i.e. in ISO/IEC 14651:

<U00000001>..<UEFFFFFFF<> means the range of all characters from 1 to
xEFFFFFFF in UCS order, regardless of the coded character set under the h=
ood.

I agree that this should *not* be locale-dependent though and that it
should only be used in defining tables that have no
natural-language-specific dependency. We should not spread the bad
programming technique, just because what is aimed in programming is just
about this, it is absolutely locale-dependent if used in general-purpose
applications, and using this to find matching strings makes a very
parochial program, not localizable at all, not only between coutries, but
also between different platforms in the same country. And it guarantees a
mess for end-users, agreatly affecting their productivity in real life. I=
t
is not because they would not find something that it is not in the data
base they are searching for, but they will conclude so, and this will rai=
se
the adrenaline level in the blood flow of their customers... Imagine a
bank-robber who has to wait for the cash just because they do not retriev=
e
something in a data base with a regular expression (: Poor thief! (;

Alain LaBont=E9
Qu=E9bec