From keld  Fri Mar 13 19:26:32 1998
Received: (from keld@localhost) by dkuug.dk (8.6.12/8.6.12) id TAA08390; Fri, 13 Mar 1998 19:26:32 +0100
Message-Id: <199803131826.TAA08390@dkuug.dk>
From: keld@dkuug.dk (Keld J|rn Simonsen)
Date: Fri, 13 Mar 1998 19:26:30 +0100
In-Reply-To: =?iso-8859-1?Q?Kolbj=F8rn?==?iso-8859-1?Q?_?==?iso-8859-1?Q?Aamb=F8?=@unicode.org
       "Re: Regular expressions in Unicode (Was: Ethiopic text)" (Mar 13, 14:24)
X-Charset: ISO-8859-1
X-Char-Esc: 29
Mime-Version: 1.0
Content-Type: Text/Plain; Charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Mnemonic-Intro: 29
X-Mailer: Mail User's Shell (7.2.2 4/12/91)
To: unicode@unicode.org
Subject: Re: Regular expressions in Unicode (Was: Ethiopic text)
Cc: i18n

=?iso-8859-1?Q?Kolbj=F8rn?==?iso-8859-1?Q?_?==?iso-8859-1?Q?Aamb=F8?=@unicode.org writes:

> Peter Westlake <peter@harlequin.co.uk> wrote:
> :
> >Now, if I want to find a word beginning with A in a list of
> >scientific words used in English, then I would hope to find
> >"=C5ngstr=F8m". But if I were searching for names beginning with
> >A in the Danish telephone directory, it would be a mistake to
> >find "=C5ngstr=F8m". So I need to say what I mean. If I want to
> >match A-F in English, I need a short way of saying whether to
> >include accents and case and of saying that I mean English.
> >Something like [A-F::u,a,uk] where u means upper case, a means
> >any accent, uk is from a standard list of codes. The range is
> >interpreted in the context of the UK collating sequence. To
> >omit =C5ngstr=F8ms, I would ask for ^[A::u,a,dk]*  meaning "a string
> >beginning with a letter that matches A in Danish". In this context,
> >"Danish" and "English" can be seen as equivalence relations that
> >partition the character set into equivalence classes. Kolbj=F8rn
> >gave an example of such a relation.

You should normally treat a search pattern according to the 
locale of the user, not the originator. So the user will get 
things matched and sorted according to his/her own expectations,
the rules that the producer used should not matter. It is
quite difficult to know all the rules of the data producer,
eg the Danish telephone directory, would you know the rules there?
I would bet that most people in the world do not know
about Danish æøå sorting and matching rules, and even less
the rules for aa ü ð þ ö ä and other letters.

So rule number one: always sort and match according to 
the expectations of the user.

For sophisticated users, you could then say, I expect results
according to this specific foreign collating sequence.

Keld