SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Site Report for the Text REtrieval Conference
chapter
P. Nelson
National Institute of Standards and Technology
Donna K. Harman
The modules used for indexing are as follows:
* Parse Document: Looks for codes in the text database to locate fields such as title,
headline, authors, etc. These fields can be indexed, ignored, or stored in a special
database for fast access. Parse document takes a command file which describes the
structure of the text database to be indexed and can handle a wide variety of different text
file formats.
* Tokenize: Divides a string of characters into words. This may include special
processing for dates, phone numbers, floating point numbers, hyphens, etc.
* Morphology: An advanced form of stemming, attempts to remove suffixes and perform
spelling changes to reduce words to simpler forms which are found in the dictionary.
For example, one morphology rule will take "babies," strip the "ies," add "y", and
produce "baby" which is found in the dictionary. Irregular forms of words are stored
direcdy in the dictionary and are not subject to morphological analysis.
Morphology is a much more accurate form of word reduction than stemming, because
the dictionary can be used to validate the transformations. Morphology will not reduce
proper nouns, and will produce much more accurate reductions, especially for words
endingin "e."
* Find Idioms: Idioms are phrases which have a meaning beyond that of the individual
words added together. For example, "kangaroo court" has nothing whatsoever to do
with kangaroos. Also proper nouns, such as "United States," have a meaning beyond
the sum of their component parts.
This module finds idioms in the text and indexes the idiom as a single unit. This
prevents idioms such as "Dow Jones Industrial Average" from getting confused with
queries on "industrial history." Words inside of idioms can still be located individually,
if desired.
* Index: The final step is to store the reduced words and collected idioms into the indexes.
The index is an inverted positional word index, which is conceptually similar to the
index at the back of a textbook.
The speed of indexing by ConQuest has been measured to be approximately 40 megabytes
per hour. This evaluation was done on a Sun IPc Sparc-Station (a 14 MIP computer), with
32 megabytes of RAM. The same speed has been achieved on a 486 IBM-PC, running at a
clock rate of 33 Mhz, with 16 megabytes of RAM.
Query
The query process is more complex than indexing, due to word expansion and ranking.
Generally speaking, ConQuest attempts to refme and enhance the user's query. The result
is then matched against the indexes to look for documents which contain similar patterns.
Queries are not "understood" in the traditional sense of natural language processing.
ConQuest makes no attempt to deeply understand the objects in the query, their interaction,
or the user's intent. Rather, ConQuest attempts to understand the meaning of each
individual word and the importance of the word. It then uses the set of meanings and their
related terms (retrieved from the semantic networks) as a statistical set which is matched
against document information stored in the indexes.
291