SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Site Report for the Text REtrieval Conference chapter P. Nelson National Institute of Standards and Technology Donna K. Harman Introduction ConQuest software has a commercially avallable text search and retrieval system * called "ConQuest" (for Concept Quest). ConQuest is primarily an advanced statistical based search system, with processing enhancements drawn from the field of Natural Language Processing (NLP). ConQuest participated in Category A of TREC, and so produced results for 50 test queries over the entire 2.3 Gigabyte database. In this category, we constructed queries and submitted results for two methods: Method 1 where queries were automatically generated from the TREC topics, and Method 2 where queries were manually constructed by the software engineers at ConQuest We were extremely pleased with our performance in TREC, as the Category A system with the highest 11 point averages. This performance is not indicative of our full potential, however, for the system is still relatively young. We are continuing to evaluate, test, and tune our dictionaries, ranking algorithms, and search methods. This paper describes our background, the system architecture used for TREC, some features of ConQuest, and a discussion of the results. This paper is written for those with a background in computers with some exposure to artificial intelligence and text retrieval. Background ConQuest has been working on text retrieval since 1988. From the beginning, we meant to use Natural Language Processing (NLP) to better understand the text database and improve retrieval accuracy. This was a natural approach, since both founders of ConQuest, Edwin Addison & Paul Nelson, teach NLP at Johns Hopkins University. During development, we concentrated on solving the two biggest problems of Natural Language Processing: 1) Most NLP requires large, hand-crafted knowledge bases, and 2) traditional techniques are not robust in the face of text errors. To solve the first problem, we relied on machine readable dictionaries and thesauri for all knowledge data. Machine readable dictionaries provide ample information on syntax, word variations, and inflected forms. Thesauri and similar sources provide semantic information (word and meaning relationships) which were compiled into structured networks. Both sources were judged to be useful for text retrieval. Combining the resources, however, required significant engineering effort. The second problem, that of robust processing, required work in two areas. First, NLP development was directed towards statistical approaches and away from rule-based approaches. Statistical approaches typically use heuristics or probabilities to provide confidence factors that accumulate evidence. Unlike rules, which are typically passifail, statistical approaches can handle unexpected or variable input without causing total system failure. But to fully solve the problem of robust processing, we had to have good, solid software engineering. Most NLP systems are ad-hoc affairs, often thrown together at the last minute and patched up. ConQuest has concentrated on producing concrete bullet-proof software, * For additional inforrnation on ConQuest, please contact the author 288