SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-2 Routing and Ad-Hoc Retrieval Evaluation using the INQUERY System
chapter
W. Croft
J. Callan
J. Broglio
National Institute of Standards and Technology
D. K. Harman
of this is that complex Boolean queries can be evaluated as easily as natural language queries
and produce ranked output. It is also possible to represent "rule-based" or "concept-based"
queries in the same probabilistic framework. This has led to us concentrating on automatic
analysis of queries and techniques for enhancing queries rather than on in-depth analysis
of the documents in the database. In general, it is more effective (as well as efficient) to
analyze short query texts than millions of document texts. The results of the query analysis
are represented in the INQUERY query language which contains a number of operators,
such as #SUM, #AND, #OR, #NOT, #PHRASE, and #SYN. These operators implement
different methods of combining evidence and describing concepts.
Some of the specific research issues we are addressing are morphological analysis in En-
glish and Japanese, word sense disambiguation in English, the use of phrases and other
syntactic structure in English and Japanese, the use of special purpose recognizers (for
example, company, country and people name recognizers) in representing documents and
queries, analyzing natural language queries to build structured representations of informa-
tion needs, learning techniques appropriate for routing and structured queries, techniques
for acquiring domain knowledge by corpus analysis, and probability estimation techniques
for indexing.
The first TREC evaluation and the two previous TIPSTER evaluations have made it
clear that a lot remains to be learned about retrieval in large, full-text databases based
on complex information needs. Issues as phrases, relevance feedback, and probability es-
timation have proven to be quite difficult in such environments. On the other hand, the
effectiveness levels achieved have been quite good. The experiments done in the TREC-
2 evaluation, together with the 24 month TIPSTER evaluation which followed it, were
designed to improve our understanding about which IR techniques work and why.
2 System Description
The document retrieval and routing system that has been developed on the basis of the in-
ference net model is called INQUERY [2]. The main processes in INQUERY are document
indexing, query processing, query evaluation and relevance feedback.
In the document indexing process, documents are parsed and index terms representing
the content of documents are identified. INQUERY supports a variety of indexing tech-
niques including simple word-based indexing, indexing based on part-of-speech tagging and
phrase identification, and indexing by domain-dependent features such as company names,
dates, locations, etc. The last type of indexing is a first step towards integrating detection
and extraction systems.
In more detail, the document structure is used to identify which parts will be used for
indexing. The first step of this process is then to scan for word tokens. Most types of
words (including numbers) are indexed, although a stopword list is used to remove very
common words. Stopwords can be indexed, however, if they are capitalized (but not at
the start of sentences) or joined with other words (e.g. "the The-i system"). Words are
then stemmed to conflate variants. Although the Porter stemmer was used for the TREC-2
76