Gilles Detillieux (grdetil@scrc.umanitoba.ca)
Thu, 18 Mar 1999 12:26:18 -0600 (CST)
According to Andrew Scherpbier:
> One of things that seems to be hard to deal with is defining exactly what a
> word is. Everyone is most likely very much aware of my first attempt at
> this: valid_punctuation. :-) Well, I think a much better method would be to
> add multiple word permutations to the database. For example something like
> "D'Amore" (last name of one of my coworkers) could be entered into the
> database as "d'amore" and "amore". The problem there is the word location.
> My first thought was to give them both the same location number, but they
> really aren't the same word, so a phrase search (which would presumably need
> to be done on the *exact* words, not permutations) could possibly give
> incorrect results. Maybe a better example would be something like
> "word-source" which would be entered into the database as "word", "source",
> and "word-source". What are the locations for those words, then?
Just to give a few examples I thought of, I think all these phrases should
be treated as equivalent in a phrase search:
Linux User Group
Linux Users Group
Linux User's Group
Linux Users' Group
Linux User-Group
Also, any of these ought to match the same word:
cooperation
co-operation
coöperation (diaeresis, seldom used in English, but valid
nonetheless)
Note that in the case of user-group, and a lot of hyphenated compound
words, you want to treat the words separately, but in some cases, the
hyphen should be ignored and the whole compound word treated as a single
word. E.g. electro-physiology = electrophysiology, post-doctoral =
postdoctoral, but: activity-dependent = activity dependent, full-time =
full time.
I think the only way to deal with these consistently would be to enter
the individual words, and their concatenation, separately into the
database. So word-source should be entered as word, source, wordsource,
and possibly word-source, depending on how htsearch will deal with the
hyphen.
-- Gilles R. Detillieux E-mail: <grdetil@scrc.umanitoba.ca> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev@htdig.org containing the single word "unsubscribe" in the SUBJECT of the message.
This archive was generated by hypermail 2.0b3 on Thu Mar 18 1999 - 10:47:16 PST