SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2
chapter
C. Buckley
J. Allan
G. Salton
National Institute of Standards and Technology
D. K. Harman
term vector as follows:
S(D[OCRerr],Q5) = [OCRerr](d[OCRerr]k*q[OCRerr]k)
k=1
Thus, the similarity between two texts
(whether query or document) depends on the
weights of coinciding terms in the two vectors.
Information retrieval and text linking sys-
tems based on the use of global text similar-
ity measures such as that of expression (2)
will be successful when the common terms in
the two vectors are in fact used in seman-
tically similar ways. In many cases it may
happen that highly-weighted terms that con-
tribute substantially to the text similarity are
semantically distinct. For example, a sound
may be an audible phenomenon, or a body of
water.
TREC 1 ([1]) demonstrated that local con-
texts could be used to disambiguate word
senses, for example rejecting documents about
industrial salts'' when given a query about
the "SALT peace treaty". Overall, however,
the improvement in effectiveness due to local
matching was minimal in TREC 1. One reason
for this is the richness of the TREC queries.
Global text matching is almost invariably suf-
ficient for disambiguation. Another reason is
the homogeneity of the queries. They deal pri-
marily with two subjects: finance, and science
and technology. Within a single subject area
vocabulary is more standardized and ambigu-
ity is therefore minimized.
One other potential reason for the unexpect-
edly slight improvement is that most of the in-
formation from local matches is simply being
thrown away. Local matches are used as a fil-
ter to reject documents that do not satisfy a
local criteria: the overall global similarity used
for ranking is changed only by the addition of
a constant indicating the local match criteria
was satisfied. The positive information that a
long document might have a single paragraph
which very closely matched the query is ig-
nored.
For TREC 2, we look at combining global
and local similarities into a single final simi-
larity to be used for ranking purposes.
The other focus of our TREC 2 work is tak-
ing advantage of the vast quantity of relevance
judgements available for the routing experi-
ments. In TREC 1, the relevance information
was fragmentary and even occasionally incor-
rect. It was hard to use this information in a
reasonable fashion. Happily, the results of the
(2) TREC 1 experiments furnished a large num-
ber of very good relevance judgements to be
used for TREC 2. Conventional vector-space
feedback methods of query expansion and re-
weighting are tuned for the TREC environ-
ment in the routing portion of TREC 2.
System Description
The Cornell TREC experiments use the
SMART Information Retrieval System, Ver-
sion 11, and are run on a dedicated Sun Sparc
2 with 64 Mbytes of memory and 5 Gbytes of
local disk.
SMART Version 11 is the latest in a long
line of experimental information retrieval sys-
tems, dating back over 30 years, developed un-
der the guidance of G. Salton. Version 11 is
a reasonably complete re-write of earlier ver-
sions, and was designed and implemented pri-
marily by C. Buckley. The new version is ap-
proximately 44,000 lines of C code and docu-
mentation.
SMART Version 11 offers a basic frame-
work for investigations into the vector space
and related models of information retrieval.
Documents are fully automatically indexed,
with each document representation being a
weighted vector of concepts, the weight indi-
cating the importance of a concept to that par-
ticular document (as described above). The
document representatives are stored on disk as
an inverted file. Natural language queries un-
dergo the same indexing process. The query
representative vector is then compared with
the indexed document representatives to ar-
rive at a similarity (equation (2)), and the doc-
uments are then fully ranked by similarity.
Ad-hoc Results
Cornell submitted two runs in the ad-hoc cat[OCRerr]
gory. The first, crulV2, is a very simple vector
comparison. The second, crnlL2, makes use
of simplified least squares analysis and a train-
ing set to combine global similarity and part-
wise similarities in a meaningful ratio. Both
systems performed at or above the median in
almost all queries, as can be seen in in Table 1.
The crnlV2-b run is the same as the official
crnlV2 run, but with an error in the experi-
mental procedure corrected (discussed below).
47