NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) The ConQuest System chapter P. Nelson National Institute of Standards and Technology D. K. Harman i[OCRerr]iuest Query I Dictio1nary I A ~ I-I B A Enhancemen Documents \<~`JMeanings~~ [OCRerr]tic &\exes LNetworksj [OCRerr] Figure 1 The Query Process The following is a description of the modules used for query: * Tokenize: Divides a string of characters into words. Morphology: An advanced form of stemming; attempts to remove suffixes and perform spelling changes to reduce words to simpler forms which are found in the dictionary. For example, one morphology rule will take "babies," strip the "ies," add "y," and produce "baby," which is found in the dictionary. Find Idioms: This module finds idioms in the text and indexes the idiom as a single unit. This prevents idioms such as "Dow Jones Industrial Average" from getting confused with queries on "industrial history." Words inside of idioms can still be located individually, if desired. Query Enhancement: The user is given the opportunity to enhance the query for additional improvement in precision and recall. There are many options available here, but the two most important are to choose meanings and weight query terms. Choosing a meaning of a word will restrict the expansion of words to only related terms which are relevant to the chosen meanings. This reduces noise in the query. When running in automatic mode, ConQuest expands all meanings of all words. Weighting query terms identifies the importance of the various words in the query. These weights are used by the search engine when ranking documents and computing document relevance factors. Remove Stop Words: Small function words-such as determiners, conjunctions, auxiliary verbs, and small adverbs-are removed from the query. * Expand Meanings: Words in the query are expanded to include related terms. 266 * Search and Rank: ConQuest uses an integrated search and rank algorithm (described in the next section) which considers the relevance rankings of documents throughout the search process. Since ranking and search are integrated, the search engine automatically produces the most relevant documents right away. Queries can be expanded to a very large number of terms, if desired. If the user wishes for the greatest amount of recall, a 5 word query can be expanded to 200 or 300 related terms. Many other query features are also available in ConQuest, including wildcards, fuzzy spelling expansion, numeric and date range searching, boolean, mixed boolean and statistical, fielded searching (a variety of types), and searching over document categories. Ranking Factors Ranking and retrieval with ConQuest uses a variety of statistics and criteria, which are flexible and can be modified to handle varying requirements. The following are some of the factors used in ranking: Completeness: A good document should contain at least one term or related term for each word in the original query. Contextual Evidence: Words are supported by their related terms. If a document contains a word and its related terms, then the word is given a higher weight because it is surrounded by supporting evidence. Semantic Distance: The semantic network contains information on how closely two terms are related. Proximity: A document is considered to be more relevant if it contains matching terms which occur close together, preferably in the same paragraph or sentence. Quantity: The absolute quantity of hits in the document is also included, but is not as strong a discriminator of relevance as the other factors. ConQuest is the first truly "concept-based" search system to operate over unrestricted domains. If a document contains the word and some of its related terms, the word is more likely to be used in the correct context, using the "contextual evidence" factor above. In this way, ConQuest can determine word meanings at query time. Coarse and Fine Grain Ranking To further improve retrieval speed, ConQuest performs the search in two phases. The first is "coarse-grain." This phase is integrated with the document search process. Documents are output from the ConQuest search engine in descending coarse-grain rank order. To compute the coarse-grain rank for a document, the statistics for the words contained in the document are combined using the coarse-grain ranking function. The inputs to this function include the semantic network