SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Combination of Multiple Searches
chapter
E. Fox
J. Shaw
National Institute of Standards and Technology
D. K. Harman
Combination of Multiple Searches
Edward A. Fox and Joseph A. Shaw
Department of Compnter Science
Virginia Tech, Blacksburg, VA 24061-0106
Abstract
The TREC-2 project at Virginia Tech focused on meth-
ods for combining the evidence from multiple retrieval
runs to improve retrieval performance over any sin-
gle retrieval method. This paper describes one such
method that has been shown to increase performance
by combining the similarity values from five different
retrieval runs using both vector space and P-norm ex-
tended boolean retrieval methods.
1 Overview
The primary focus of our experiments at Virginia Tech
involved methods of combining the results from vari-
ous divergent search schemes and document collections.
We performed both routing and ad-hoc retrieval experi-
ments on the provided test collections. The results from
both vector and P-norm type queries were considered in
determining the probability of relevance for each docu-
ment in an individual collection. The results for each
collection were then merged to create a single final set
of documents that would be presented to the user.
2 Index Creation
This section outlines the indexing done with the doc-
ument collection provided by NIST. Each of the indi-
vidual collections was indexed separately as document
vector files; limitations in disk space prohibited the use
of inverted files and the creation of a single combined
document vector file.
All processing was performed on a DECstation
5000/25 with 40 MB of RAM using the 1985 release
of the SMART Information Retrieval System [2], with
enhancements from previous experiments as well as a
new modification for our TREC-2 experiments.
The index files were created from the source text via
the following process. First, the source document text
provided by NIST was passed through a preparser to
convert the SGML-like format to the proper format for
243
Table 1: SMART weighting schemes used for TREC-2.
SMART
label term[OCRerr]weight =
ann 0.5 + 0.5 * tf
11 max[OCRerr]tJ
bnn 1
mnn tf
[OCRerr] ma[OCRerr][OCRerr]t f
am 0.5+ [OCRerr] * log( num[OCRerr]doc,)
[OCRerr] 2*max[OCRerr]tI colI[OCRerr]fr[OCRerr]g
nnn if
the 1985 version of SMART. The extraneous sections
of the documents were filtered out at this point. The
TEXT sections of the documents, as well as the various
HEADLINE, TITLE, SUMMARY, and ABSTRACT
sections of the collections were indexed; all of the other
sections were ignored. The subsections of the TEXT
fields, where they existed, were considered as part of the
TEXT field, with the subsection delimiters removed.
The resulting filtered text was tokenized, stop words
were deleted using the standard 418 word stop list
provided with SMART, and the remaining non-noise
words were included in the term dictionary along with
their occurrence frequencies. Each term in the dictio-
nary has a unique identification number. A document
vector file was created during indexing which contains
for each document its unique ID, and a vector of term
IDs and term weights. The initially recorded weights
can be changed based on one of several schemes after
the indexing is complete. The various SMART weight-
ing schemes referred to within this paper are summa-
rized in Table 1. The dictionary size for each collection
was approximately 16 MB, while the document vector
files ranged from 31 MB to 124MB (see Table 2).