From ALB@immedia.ca Fri Jun 20 10:09:00 1994
Received: from Clouso.CRIM.CA by dkuug.dk with SMTP id AA07199
  (5.65c8/IDA-1.4.4j for <i18n@dkuug.dk>); Mon, 20 Jun 1994 17:04:18 +0200
Received: from immedia.ca by clouso.crim.ca (4.1/SMI-4.1)
	id AA27277; Mon, 20 Jun 94 11:04:03 EDT
Return-Path: <ALB@immedia.ca>
Received: by immedia.ca (3.2/2.D)
        id AA14137; 20 Jun 94 15:10:10 -0500
Date: 20 Jun 94 15:09:00 -0500
From: ALB@immedia.ca
Message-Id: <199406201510.AA14137@immedia.ca>
To: bealle@torolab6.vnet.ibm.com, cpwg-mail@revcan.ca, paref@vm1.ulaval.ca,
        umavs@torolab6.vnet.ibm.com
Cc: i18n@dkuug.dk, sc22wg20@dkuug.dk, tc304@dkuug.dk
Subject: Full-text searching: killed word lists not a panacea
X-Charset: ASCII
X-Char-Esc: 29

----------
To       : rnet(comp-software-international@news-digests.mit.edu)
Subject  : Full-text searching: kill lists
From     : ALB
Date     : 6/20/94
Time     : 14:41

Ted Dunning says about my posting:

>  The most hitting example (...) is the search on "D<U^>" (if you
>  ignore the circumflex accent [which make the word mean "DUE" in
>  English, it becomes the article "DU" [masculin case of "OF THE"],
>  the 22nd most frequent word of the French language: the result is
>  only noise...)
>
>most information retrieval systems would include du as a killed word,
>which would allow d<u^> to be indexed without or without accents.
>
>it is clearly important to perform kill list processing at the right
>point, but retrieving without regard to accents can be considered a
>form of normalization similar to suffix stripping.  it will probably
>have about the same effects, some positive, some negative.

It is different from suffix stripping in French in a lot of cases, as quite
often accents apply to ethimologically-different objects.  If you wouldn't
object to retrieve in English all the occurrences of "cannon" when you search
for "canyon", or "sin" when you look for "fishing", or "Macon" (wine) when you
look for "mason", or "hill" and "side" when you look for "quotation", or
"discomfort" when you look for "gene", then I might follow you.  Of course none
of the equivalent French words "canon", "ca<n~>on", "p<e'>ch<e'>", "p<e^>che",
"Macon", "ma<c,>on", "c<o^>te", "c<o^>t<e'>", "cote", "g<e^>ne", "g<`e>ne" can
be processed as killed words automatically and retrieving them when you don't
want will make you have more trouble with garbage if you use certain languages
with this primitive technique.

The "killed" word technique is of course essential justly to remove garbage, but
it is certainly not a panacea to replace the multilevel searching approach.
Diacritics are more meaningful in some languages than in others.  French is in
the first category and it is not exceptional to have meaningful accents in
this language (even if I agree that in many cases too, they are there only for
pronunciation, in which case it is not meaningful; but you don't want to take
the last specific case as the rule, do you?  the Acad<e'>mie did not even
succeed to make disappear accents in the few tens of words where they were
useless [there were almost riots in the streets of Paris against the "righting"
of spelling adopted by the French parliament a few years ago -- I barely
exaggerate]; but apart from this, in most cases accents are there in French
both for meaning and pronunciation purposes at the same time).

You should also be able to shut down the "killed word" technique in some cases
too. And to fine-tune unaccented searches on specific accented words.

Alain LaBont<e'>