SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
The ConQuest System
chapter
P. Nelson
National Institute of Standards and Technology
D. K. Harman
i[OCRerr]iuest
Query I Dictio1nary I
A
~
I-I
B
A Enhancemen
Documents
\<~`JMeanings~~
[OCRerr]tic &\exes
LNetworksj [OCRerr]
Figure 1 The Query Process
The following is a description of the modules used for
query:
* Tokenize: Divides a string of characters into words.
Morphology: An advanced form of stemming;
attempts to remove suffixes and perform spelling
changes to reduce words to simpler forms which are
found in the dictionary. For example, one morphology
rule will take "babies," strip the "ies," add "y," and
produce "baby," which is found in the dictionary.
Find Idioms: This module finds idioms in the text
and indexes the idiom as a single unit. This prevents
idioms such as "Dow Jones Industrial Average" from
getting confused with queries on "industrial history."
Words inside of idioms can still be located
individually, if desired.
Query Enhancement: The user is given the
opportunity to enhance the query for additional
improvement in precision and recall. There are many
options available here, but the two most important are
to choose meanings and weight query terms.
Choosing a meaning of a word will restrict the
expansion of words to only related terms which are
relevant to the chosen meanings. This reduces noise in
the query. When running in automatic mode,
ConQuest expands all meanings of all words.
Weighting query terms identifies the importance of the
various words in the query. These weights are used by
the search engine when ranking documents and
computing document relevance factors.
Remove Stop Words: Small function words-such as
determiners, conjunctions, auxiliary verbs, and small
adverbs-are removed from the query.
* Expand Meanings: Words in the query are expanded to
include related terms.
266
* Search and Rank: ConQuest uses an integrated search
and rank algorithm (described in the next section)
which considers the relevance rankings of documents
throughout the search process. Since ranking and
search are integrated, the search engine automatically
produces the most relevant documents right away.
Queries can be expanded to a very large number of terms, if
desired. If the user wishes for the greatest amount of recall,
a 5 word query can be expanded to 200 or 300 related terms.
Many other query features are also available in ConQuest,
including wildcards, fuzzy spelling expansion, numeric and
date range searching, boolean, mixed boolean and statistical,
fielded searching (a variety of types), and searching over
document categories.
Ranking Factors
Ranking and retrieval with ConQuest uses a variety of
statistics and criteria, which are flexible and can be modified
to handle varying requirements. The following are some of
the factors used in ranking:
Completeness: A good document should contain at
least one term or related term for each word in the
original query.
Contextual Evidence: Words are supported by their
related terms. If a document contains a word and its
related terms, then the word is given a higher weight
because it is surrounded by supporting evidence.
Semantic Distance: The semantic network contains
information on how closely two terms are related.
Proximity: A document is considered to be more
relevant if it contains matching terms which occur
close together, preferably in the same paragraph or
sentence.
Quantity: The absolute quantity of hits in the
document is also included, but is not as strong a
discriminator of relevance as the other factors.
ConQuest is the first truly "concept-based" search system to
operate over unrestricted domains. If a document contains
the word and some of its related terms, the word is more
likely to be used in the correct context, using the
"contextual evidence" factor above. In this way, ConQuest
can determine word meanings at query time.
Coarse and Fine Grain Ranking
To further improve retrieval speed, ConQuest performs the
search in two phases. The first is "coarse-grain." This phase
is integrated with the document search process. Documents
are output from the ConQuest search engine in descending
coarse-grain rank order.
To compute the coarse-grain rank for a document, the
statistics for the words contained in the document are
combined using the coarse-grain ranking function. The
inputs to this function include the semantic network