SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2
chapter
C. Buckley
J. Allan
G. Salton
National Institute of Standards and Technology
D. K. Harman
Automatic Routing and Ad-hoc Retrieval
Using SMART : TREC 2
Chris BuCkley*, James Allan, and Gerard Salton
Abstract
The Smart information retrieval project em-
phasizes completely automatic approaches to
the understanding and retrieval of large quan-
tities of text. We continue our work in the
TREC 2 environment7 performing both rout-
ing and ad-hoc experiments. The ad-hoc
work extends our investigations into combin-
ing global similarities, giving an overall indica-
tion of how a document matches a query, with
local similarities identifying a smaller part of
the document which matches the query. The
performance of the ad-hoc runs is good, but it
is clear we are not yet taking full advantage of
the available local information.
Our routing experiments use conventional
relevance feedback approaches to routing, but
with a much greater degree of query expan-
sion than was done in TREC 1. The length
of a query vector is increased by a factor of
5 to 10 by adding terms found in previously
seen relevant documents. This approach im-
proves effectiveness by 30-40% over the origi-
nal query.
Introduction
For over 30 years, the Smart project at Cor-
nell University has been interested in the anal-
ysis, search, and retrieval of heterogeneous
text databases, where the vocabulary is al-
lowed to vary widely, and the subject matter
is unrestricted. Such databases may include
newspaper articles, newswire dispatches, text-
books, dictionaries, encyclopedias, manuals,
magazine articles, and so on. The usual text
analysis and text indexing approaches that are
based on the use of thesauruses and other vo-
cabulary control devices are difficult to apply
*Department of Computer Science, Cornell Univer-
sity, Ithaca, NY 14853-7501. This study was sup-
ported in part by the Nationai Science Foundation un-
der grant IRI 89-15847.
45
in unrestricted text environments, because the
word meanings are not stable in such circum-
stances and the interpretation varies depend-
mg on context. The applicability of more com-
plex text analysis systems that are based on
the construction of knowledge bases covering
the detailed structure of particular subject ar-
eas, together with inference rules designed to
derive relationships between the relevant con-
cepts, is even more questionable in such cases.
Complete theories of knowledge representation
do not exist, and it is unclear what concepts,
concept relationships, and inference rules may
be needed to understand particular texts.[11]
Accordingly, a text analysis and retrieval
component must necessarily be based primar-
ily on a study of the available texts themselves.
Fortunately very large text databases are now
available in machine-readable form, and a sub-
stantial amount of information is automati-
cally derivable about the occurrence properties
of words and expressions in natural-language
texts, and about the contexts in which the
words are used. This information can help in
determining whether a query and a text are se-
mantically homogeneous, that is, whether they
cover similar subject areas. When that is the
case, the text can be retrieved in response to
the query.
Automatic Indexing
In the Smart system, the vector-processing
model of retrieval is used to transform both
the available information requests as well as
the stored documents into vectors of the form:
= (w[OCRerr]17w[OCRerr]27.. .7 w[OCRerr]t)
where D[OCRerr] represents a document (or query)
text and Wik is the weight of term Tk in doc-
ument D[OCRerr]. A weight of zero is used for terms
that are absent from a particular document,
and positive weights characterize terms actu-