SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) The ConQuest System chapter P. Nelson National Institute of Standards and Technology D. K. Harman The ConQuest System Paul E. Nelson VP of Research & Development CONQUESTTM SOFTWARE 9705 Patuxent Woods Drive, Columbia, Maryland 21046 (410)-290-6290 Introduction ConQuest software has a commercially available text search and retrieval system* called "ConQuest" (for Concept Quest). ConQuest is primarily an advanced statistical based search system, with processing enhancements drawn from the field of Natural Language Processing (NLP). ConQuest participated in Category A of TREC, and so produced results for 50 test queries over the entire 2.3 Gigabyte database. In this category, we constructed queries and submitted results for two different ranking functions. These two functions tested the difference between local and global document relevancy, and are fully described later. In TREC-2, ConQuest had a very strong showing. Our recall scores in particular improved by about 18 percentage points over the adjusted TREC- I scores. Our precision scores were also very competitive. The purpose of this paper is to discuss how we prepared for TREC-2: how queries were performed, what initial judgments were made and why, and interpretation of the results. Then, I will cover the tests which were performed after TREC-2, and how these tests clearly identify the areas where ConQuest could most effectively be improved. System Architecture For a complete discussion of the system architecture of ConQuest, see the TREC- 1 conference proceedings, or call the author. The following overview is meant as a brief refresher. ConQuest uses pre-built indexes to perform text database searches at fast speeds. In such a system, all text to be searched must first be indexed. These indexes are then used for all searching; the original document data is not required. ConQuest uses a dictionary augmented with a semantic network for both indexing and queries. The dictionary is a list of words where each word contains multiple meanings. Each meaning contains syntactic information (part-of- speech, feature values), and a dictionary definition. * For additional information on ConQuest, please contact the author. 265 The semantic network contains nodes which correspond to meanings of words. These nodes are linked to other related nodes. Relationships between nodes are extracted from machine readable dictionaries. Some example relationship types include synonym, antonym, child-of, parent-of, related-to, part-of, substance-of, contrasting, and similar-to. The ConQuest dictionary was generated automatically from several Machine Readable Dictionary (MRDs) sources, commercially available. This gives ConQuest the most robust and thorough coverage of English available. It is the completeness of coverage that drives performance gains in recall and precision. Since ConQuest is a commercially available product, many additional components, not required for TREC-2, are also available, such as true client/server, graphical user interfaces, routing and dissemination, and sophisticated application program interfaces. Query Generally speaking, ConQuest attempts to refine and enhance the user's query. The result is then matched against the indexes to look for documents which contain similar concepts. Queries are not "understood" in the traditional sense of natural language processing. ConQuest makes no attempt to deeply understand the objects in the query, their interaction. or the user's intent. Rather, ConQuest attempts to understand the meaning of each individual word and the importance of the word. It then uses the set of meanings and their related terms (retrieved from the semantic networks) as a statistical set which is matched against document information stored in the indexes.