From ALB@immedia.ca Sun Jun 18 14:46:00 1994
Received: from Clouso.CRIM.CA by dkuug.dk with SMTP id AA08415
  (5.65c8/IDA-1.4.4j for <i18n@dkuug.dk>); Fri, 17 Jun 1994 21:41:22 +0200
Received: from immedia.ca by clouso.crim.ca (4.1/SMI-4.1)
	id AA00935; Fri, 17 Jun 94 15:41:07 EDT
Return-Path: <ALB@immedia.ca>
Received: by immedia.ca (3.2/2.D)
        id AA30009; 17 Jun 94 19:47:15 +1900
Date: 17 Jun 94 19:46:00 +1900
From: ALB@immedia.ca
Message-Id: <199406171947.AA30009@immedia.ca>
To: bealle@torolab6.vnet.ibm.com, cpwg-mail@revcan.ca, paref@vm1.ulaval.ca,
        umavs@torolab6.vnet.ibm.com
Cc: i18n@dkuug.dk, sc22wg20@dkuug.dk, tc304@dkuug.dk
Subject: Re: Full-text searching: don't keep it simple and stupid!
X-Charset: ASCII
X-Char-Esc: 29

----------
I could not agree more with the annexed contribution from Olle Jarnefors (from
Sweden).  In fact even "inflections" are included in the SHARE Europe
requirement in what we call "fuzzy matching".  However the latter is more
difficult to implement without expert system technology and it is not possible
with POSIX so far (perhaps we could say it is included in the general model of
International ordering as there is a mandatory preprocessing phase, even if
its content is not specified).

So that those English people who do not necessarily understand all the
implications of inflections understand, let me cite French: if you search for
"oeil" (singular case of the word meaning "eye"), you want up to this point to
retrieve "yeux" too (the plural case of the same word), and if you search for
"beau" (masculin singular case of the word meaning "pretty", you want to also
retrieve "beaux", masculin plural, "belle", feminin singular or "belles",
feminin plural).  That's what Olle is talking about. In English this would mean
retrieving also "women" if you search for "woman", which is simpler but which is
also an inflection.

Of course in the International ordering model, the level of precision (without
taking care of inflections except in preprocessing) up to which you accept
equivalences is functionally specified: 1st level: base letters, 2nd level:
diacritics [unless the diacritics are included in the base letters]; 3rd level,
case, 4th [or 5th; see the standard for an eventual intermediary level for
Arabic] level: specials [such as hyphens].  This is also what I have called in
the past "significance levels" by analogy to floating point numbers, which
basically have 3 levels of significance (sign, order of magnitude and mantissa),
or levels of decomposition for computing and comparison purposes.

The decomposition of letters is determined by a LOCALE, not by the engine
itself. And the international big default LOCALE is tailorable.

Alain LaBont<e'>
Gouvernement du Qu<e'>bec
Secr<e'>tariat du Conseil du tr<e'>sor

Message original:
==============================================================================
         A:
            RNET ( BEALLE@TOROLAB6.VNET.IBM.COM, CPWG-MAIL@REVCAN.CA, PAREF@VM1.ULAVAL.CA,),
            RNET ( UMAVS@TOROLAB6.VNET.IBM.COM), ALB
        CC: RNET ( I18N@DKUUG.DK, SC22WG20@DKUUG.DK, TC304@DKUUG.DK,),
            RNET ( OLLE JARNEFORS <OJARNEF@ADMIN.KTH.SE>)
        De: RNET (ojarnef@admin.kth.se)
     Objet: Re: Full-text searching: don't keep it simple and stupid!
      Date: ven 17 jui 94
     Heure: 19:14 TU
      Type: Mail
 Livraison: Reguliere
==============================================================================
Alain LaBont<e'> raises the important question about how
searching should be adapted to the needs of the language of the
searched text/data. The following is what is said about this in
the Nordic report on cultural requirements [1]:

   2.4.4  Searching
   ----------------

   Because the Nordic languages are more complicated than
   English, as far as inflection and formation of compound words
   are concerned, more sophisticated search functions are
   desirable. To be most useful, interactive searching for
   strings or words in a text should be available in three
   modes:

   1. Search for exactly given words

   2. Search for all words consisting of a given string and
      possibly an inflectional ending

   3. Search for the given string, irrespective of its
      surroundings.

   Orthogonal to this, searching should be performed on three
   levels, defined by the treatment of individual letters:

   a) Exact search

   b) Search disregarding the case of letters

   c) Search also disregarding the distinction between letter
      variants which in sorting are treated as the same basic
      letter.

   For effective searching in text, it is also important that
   hyphenated words are dehyphenated, since hyphenation is more
   frequent in the Nordic languages than e.g. English, due to
   the many long compound words.

What is said here about Danish, Faroese, Finnish, Icelandic,
Norwegian, and Swedish is also true for German, Dutch and other
languages, but of course such properties as
-- which basic letters are distinguished
-- which other letters are treated as variants of a basic letter
-- which capital letter or capital string is equivalent to a
   small letter
-- which characters are allowed in a word
-- which inflectional endings are possible
vary between languages.

[1] INSTA:
    Nordic Cultural Requirements on Information Technology
    | Technical Report STRI TS3, 1992. Distributed by:
      Icelandic Council for Standardization, Keldnaholti,
      IS-112 Reykjavik, Iceland. Phone: +354-1-877 000.
      Fax: +354-1-877 409

--
Olle Jarnefors, Royal Institute of Technology, Stockholm <ojarnef@admin.kth.se>