From ALB@immedia.ca Mon Jun 16 10:23:00 1994
Received: from Clouso.CRIM.CA by dkuug.dk with SMTP id AA00344
  (5.65c8/IDA-1.4.4j for <i18n@dkuug.dk>); Thu, 16 Jun 1994 17:22:08 +0200
Received: from immedia.ca by clouso.crim.ca (4.1/SMI-4.1)
	id AA26437; Thu, 16 Jun 94 11:21:41 EDT
Return-Path: <ALB@immedia.ca>
Received: by immedia.ca (3.2/2.D)
        id AA27961; 16 Jun 94 15:27:49 -0500
Date: 16 Jun 94 15:23:00 -0500
From: ALB@immedia.ca
Message-Id: <199406161527.AA27961@immedia.ca>
To: bealle@torolab6.vnet.ibm.com, cpwg-mail@revcan.ca, paref@vm1.ulaval.ca,
        umavs@torolab6.vnet.ibm.com
Cc: i18n@dkuug.dk, sc22wg20@dkuug.dk, tc304@dkuug.dk
Subject: Full-text searching: don't keep it simple and stupid!
X-Charset: ASCII
X-Char-Esc: 29

----------
This was of more general interest and I put it on some lists which have
interests in internationalisation.
++++++++++++++++++++++++++++++++++
To       : /Personnel/comp-soft-intrn
Subject  : Full-text search: don't keep it simple and stupid
From     : ALB

Glen Seeds sent a private note to Keld Simonsen with a copy to me and asked:

>Keld, my company (which produces a full text search product) is
>attempting to establish character classes for various European
>languages.  For most such languages, our users prefer that we
>ignore case and accents.
>
>However, Danish seems to have some exceptions to this.  An 'O'
>with a slash is treated as a separate letter.  Are there others?
>For example, would users be upset if a search for "angstrom"
>ignored the ring, or conversely, would they be upset if a search
>with the ring did NOT find ones without (and vice-versa)?
>What is normal practice in Denmark?

Keld answered, legitimately and correctly:

>In Denmark, the letters O WITH STROKE, AE and A WITH RING are genuine
>letters and people would be very upset if it is not handled as such.

Now I think for French (and perhaps German and other languages too), the answer
is unfortunately not as simple.

In French there are indeed only 26 letters of the alphabet. However they have
variants, 14 accented characters (multiplied by 2 if you include upper case)
and 2 ligatures (4 with uppercase).  This is without counting borrowed
characters from borrowed words which are shown with their original or modified
spellings in French dictionaries (ex.: "ca<n~>on" [alternate but secondary
spelling "canyon"], "angstr<o:>m" [the original spelling, but not the French,
also has an A RING on the first "a"; alternate or secondary spelling
"angstroem"], "sant<a->l<i->" [where the "-" here stands for the macron
[horizontal bar overlining the letters; no alternate spelling; listed in LEXIS
dictionary, one of the most complete affordable French dictionary;
"sant<a->l<i->" is the name of a language spoken in the Indian states of Orissa
and Eastern Bengale]).

Now sometimes you want to ignore accents in full text searching [to retrieve,
for example "CL<E'>" or "CLEF" (two correct spelling for the same word) with
index "CLE"].

But sometimes also (perhaps as frequently), you want to have a full search on
accented letters ignoring the unaccented.  The most hitting example (which made
a full-text search system fail in Quebec in the 60's or early 70's because it
had been designed with unaccented letters even in text: it was for jurisprudence
of the Quebec Napoleonic code; the computer specialists at that time did not
consult the jurists and thought that what was tolerated [but incorrect] for
nominative data, i.e.  the use of upper case unaccented letters in text, was
acceptable!  The full data base had to be reentered; fortunately an expert
system helped put the accents in place according to context with 98% accuracy,
but it is a highly context-sensitive task!) is the search on "D<U^>" (if you
ignore the circumflex accent [which makes the word mean "DUE" in English, it
becomes the article "DU" [masculin case of "OF THE"], the 22nd most frequent
word of the French language: the result is only noise...)

To solve this problem, with the help of the ordering model I designed (and used
as the basis in POSIX), I redesigned the comparison engine and made sure when I
submitted the NP for the International Ordering standard that the engine was
part of the scope.  This was already a SHARE Europe requirement before the
project of ordering standard came of age and I wrote a paper in French in 1991,
titled "Fonctions de syst<`e>mes -- Soutien des langues nationales" (ISBN
2-550-21778-0), that describes different functions including this one, that
exploit the ordering mechanism which is, in fact, a comparison mechanism.

I copy here what I wrote in the latest version of the International ordering
standard I'm working on:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5.2 Comparison engine

5.2.2 Multi-field key comparison operation

The comparison engine will operate as follows:

1      Get arguments:

       Argument a:

                    option: value 1 or 2:

                     1:    comparison to be done on coded character strings

                     2:    comparison done on predefined multilevel binary keys

       Arguments b:

             if option 1:

                    string 1:    refered coded character string (argument b1)
                    string 2:    reference coded character string (argument b2)

             or if option 2:

                    key 1: refered multilevel binary key (argument b1)
                    key 2: reference multilevel binary key (argument b2)

       Argument c:  ****** sorry here I had to badly reformat for email ******

level of precision: a value between 1 and n; this value determines after which
level the binary comparison should stop if equivalence (approximate equality) is
going to be detected, in case the two arguments b are not absolutely equal.  In
all cases of inequality, the operation is done on all available levels of the
multilevel key to determine if the refered argument b1 comes before or after the
reference argument b2.  If the level of precision specified is greater than the
last level available, each of the missing levels are considered to implicity
bear the smallest value possible for the latter levels for comparison purposes.

Argument d:

                    result of comparison: following values possible:
                    1: absolute equality
                    2: equivalence (at precision level required by argument c)

Argument e *BADLY REFORMATTED FOR EMAIL* (in case of absolute equality, value 3
is returned; in case of equivalence, both values 3 and 4 are possible, as for
totally unequal strings):

                    3: refered b1 comes before reference b2
                    4: refered b1 comes after reference b2


2      Process (*** REFORMATTED BADLY FOR EMAIL ****)

If option 1 is chosen, form binary key 1 and binary key 2 respectively
corresponding to string 1 and string 2 according to clause 5.2.3; then, or if
option 2 has been chosen, do a binary comparison on both keys within the limits
of argument c and return a result with the specifications given above for
argument d.  In case multiple fields are to be processed, and when equivalence
or absolute equality has been detected, then repeat the same process for each
field of the refered and reference records to take place in the comparison,
until a difference is found or until all characters of all fields have been
compared in case of equivalence or absolute equality.
+++++++++++++++++++++++++++++++++++++++++++++++++++++

This is a rework of my original work and of the SHARE Europe requirement (which
also took into consideration a suggestion from Ed Hart, from Johns Hopkins
University, at that time at SHARE Inc. [USA], to have more granularity than only
2 cases of equalities, absolute or equivalent).  It is not definitive
(arguments d and e are still under editing process, as you will have
remarked) but gives you an idea on how the sate of the art should do things in
full-text searching, in my humble opinion.

Alain LaBont<e'>
Gouvernement du Qu<e'>bec
Secr<e'>tariat du Conseil du tr<e'>sor

ALB@IMMEDIA.CA is my real address if the header of this message shows something
else.
