From ALB@immedia.ca Tue Jun 17 10:19:00 1994
Received: from clouso.crim.ca ([192.26.210.1]) by dkuug.dk with SMTP id AA02662
  (5.65c8/IDA-1.4.4j for <i18n@dkuug.dk>); Fri, 17 Jun 1994 17:14:30 +0200
Received: from immedia.ca by clouso.crim.ca (4.1/SMI-4.1)
	id AA11439; Fri, 17 Jun 94 11:13:47 EDT
Return-Path: <ALB@immedia.ca>
Received: by immedia.ca (3.2/2.D)
        id AA25913; 17 Jun 94 15:19:55 -0500
Date: 17 Jun 94 15:19:00 -0500
From: ALB@immedia.ca
Message-Id: <199406171519.AA25913@immedia.ca>
To: bealle@torolab6.vnet.ibm.com, cpwg-mail@revcan.ca, paref@vm1.ulaval.ca,
        umavs@torolab6.vnet.ibm.com
Cc: i18n@dkuug.dk, sc22wg20@dkuug.dk, tc304@dkuug.dk
Subject: Re: (TC304.190) Full-text searching: don't keep it simple and stupi
X-Charset: ASCII
X-Char-Esc: 29

----------
The Danish or other cultural requirements are totally included in the model I am
developing.  The algorithm is multi-level and will work on a LOCALE (with
further syntax to care about multi-script properties and some extra features),
which, for ISO/IEC 10646 level 1 conformance, is POSIX-syntax conformant (the
extra script statements can be processed as comments in this case).

And the model also has tailorability features which allow to change the
definition of say, <AE> for Danish or for French/English, alternatively.

So in the message I sent, this was of course implicit, but I did not copy the
full text of the draft.

Alain

Message original:
==============================================================================
         A:
            RNET (ALB@IMMEDIA.CA, BEALLE@TOROLAB6.VNET.IBM.COM, CPWG-MAIL@REVCAN.CA,),
            RNET (PAREF@VM1.ULAVAL.CA, UMAVS@TOROLAB6.VNET.IBM.COM), ALB
        CC: RNET (I18N@DKUUG.DK, SC22WG20@DKUUG.DK, TC304@DKUUG.DK)
        De: RNET (keld@dkuug.dk)
     Objet: Re: (TC304.190) Full-text searching: don't keep it simple and stupi
      Date: ven 17 jui 94
     Heure: 05:10 TU
      Type: Mail
 Livraison: Reguliere
==============================================================================
ALB@immedia.ca writes:

> Subject  : Full-text search: don't keep it simple and stupid
>
> >Keld, my company (which produces a full text search product) is
> >attempting to establish character classes for various European
> >languages.  For most such languages, our users prefer that we
> >ignore case and accents.
> >
> >However, Danish seems to have some exceptions to this.  An 'O'
> >with a slash is treated as a separate letter.  Are there others?
> >For example, would users be upset if a search for "angstrom"
> >ignored the ring, or conversely, would they be upset if a search
> >with the ring did NOT find ones without (and vice-versa)?
> >What is normal practice in Denmark?
>
> Keld answered, legitimately and correctly:
>
> >In Denmark, the letters O WITH STROKE, AE and A WITH RING are genuine
> >letters and people would be very upset if it is not handled as such.
>
> Now I think for French (and perhaps German and other languages too), the answer
> is unfortunately not as simple.

I agree with Alain, that a number of parameters should be available,
so different searches (for example with regards to precision) are
possible.

The point in my above comment was that a cultural requirement is also
needed as a parameter, and that is not listed in Alain's model.
Or maybe you could say it is implicitely included, as the comparison
is done on a sorting algoritm - which may be cultural dependent,
as per the different national POSIX locales available.

Keld
