From ojarnef@admin.kth.se Fri Jun 17 23:04:07 1994
Received: from othello.admin.kth.se by dkuug.dk with SMTP id AA07654
  (5.65c8/IDA-1.4.4j for <i18n@dkuug.dk>); Fri, 17 Jun 1994 21:04:55 +0200
Received: from mercutio.admin.kth.se by othello.admin.kth.se (5.65+bind 1.8+ida 1.4.2/4.0b)
	id AA29870; Fri, 17 Jun 94 21:04:52 +0200
Received: by mercutio.admin.kth.se (5.65+bind 1.8+ida 1.4.2/4.0)
	id AA03029; Fri, 17 Jun 94 21:04:07 +0200
Date: Fri, 17 Jun 94 21:04:07 +0200
Message-Id: <9406171904.AA03029@mercutio.admin.kth.se>
From: Olle Jarnefors <ojarnef@admin.kth.se>
To: bealle@torolab6.vnet.ibm.com, cpwg-mail@revcan.ca, paref@vm1.ulaval.ca,
        umavs@torolab6.vnet.ibm.com
Cc: i18n@dkuug.dk, sc22wg20@dkuug.dk, tc304@dkuug.dk,
        Olle Jarnefors <ojarnef@admin.kth.se>
In-Reply-To: <199406161522.AA00356@dkuug.dk> (16 Jun 94 15:23:00 -0500;
 From: ALB@immedia.ca)
Subject: Re: Full-text searching: don't keep it simple and stupid!
X-Charset: ASCII
X-Char-Esc: 29

Alain LaBont<e'> raises the important question about how
searching should be adapted to the needs of the language of the
searched text/data. The following is what is said about this in
the Nordic report on cultural requirements [1]:

   2.4.4  Searching
   ----------------

   Because the Nordic languages are more complicated than
   English, as far as inflection and formation of compound words
   are concerned, more sophisticated search functions are
   desirable. To be most useful, interactive searching for
   strings or words in a text should be available in three
   modes:

   1. Search for exactly given words

   2. Search for all words consisting of a given string and
      possibly an inflectional ending

   3. Search for the given string, irrespective of its
      surroundings.

   Orthogonal to this, searching should be performed on three
   levels, defined by the treatment of individual letters:

   a) Exact search

   b) Search disregarding the case of letters

   c) Search also disregarding the distinction between letter
      variants which in sorting are treated as the same basic
      letter.

   For effective searching in text, it is also important that
   hyphenated words are dehyphenated, since hyphenation is more
   frequent in the Nordic languages than e.g. English, due to
   the many long compound words.

What is said here about Danish, Faroese, Finnish, Icelandic,
Norwegian, and Swedish is also true for German, Dutch and other
languages, but of course such properties as
-- which basic letters are distinguished
-- which other letters are treated as variants of a basic letter
-- which capital letter or capital string is equivalent to a
   small letter
-- which characters are allowed in a word
-- which inflectional endings are possible
vary between languages.

[1] INSTA:
    Nordic Cultural Requirements on Information Technology
    | Technical Report STRI TS3, 1992. Distributed by: 
      Icelandic Council for Standardization, Keldnaholti,
      IS-112 Reykjavik, Iceland. Phone: +354-1-877 000.
      Fax: +354-1-877 409

--
Olle Jarnefors, Royal Institute of Technology, Stockholm <ojarnef@admin.kth.se>
