SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Site Report for the Text REtrieval Conference
chapter
P. Nelson
National Institute of Standards and Technology
Donna K. Harman
Introduction
ConQuest software has a commercially avallable text search and retrieval system * called
"ConQuest" (for Concept Quest). ConQuest is primarily an advanced statistical based
search system, with processing enhancements drawn from the field of Natural Language
Processing (NLP).
ConQuest participated in Category A of TREC, and so produced results for 50 test queries
over the entire 2.3 Gigabyte database. In this category, we constructed queries and
submitted results for two methods: Method 1 where queries were automatically generated
from the TREC topics, and Method 2 where queries were manually constructed by the
software engineers at ConQuest
We were extremely pleased with our performance in TREC, as the Category A system with
the highest 11 point averages. This performance is not indicative of our full potential,
however, for the system is still relatively young. We are continuing to evaluate, test, and
tune our dictionaries, ranking algorithms, and search methods.
This paper describes our background, the system architecture used for TREC, some
features of ConQuest, and a discussion of the results. This paper is written for those with a
background in computers with some exposure to artificial intelligence and text retrieval.
Background
ConQuest has been working on text retrieval since 1988. From the beginning, we meant to
use Natural Language Processing (NLP) to better understand the text database and improve
retrieval accuracy. This was a natural approach, since both founders of ConQuest, Edwin
Addison & Paul Nelson, teach NLP at Johns Hopkins University.
During development, we concentrated on solving the two biggest problems of Natural
Language Processing: 1) Most NLP requires large, hand-crafted knowledge bases, and
2) traditional techniques are not robust in the face of text errors.
To solve the first problem, we relied on machine readable dictionaries and thesauri for all
knowledge data. Machine readable dictionaries provide ample information on syntax, word
variations, and inflected forms. Thesauri and similar sources provide semantic information
(word and meaning relationships) which were compiled into structured networks. Both
sources were judged to be useful for text retrieval. Combining the resources, however,
required significant engineering effort.
The second problem, that of robust processing, required work in two areas. First, NLP
development was directed towards statistical approaches and away from rule-based
approaches. Statistical approaches typically use heuristics or probabilities to provide
confidence factors that accumulate evidence. Unlike rules, which are typically passifail,
statistical approaches can handle unexpected or variable input without causing total system
failure.
But to fully solve the problem of robust processing, we had to have good, solid software
engineering. Most NLP systems are ad-hoc affairs, often thrown together at the last minute
and patched up. ConQuest has concentrated on producing concrete bullet-proof software,
* For additional inforrnation on ConQuest, please contact the author
288