SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Combining Evidence from Multiple Searches
chapter
E. Fox
M. Koushik
J. Shaw
R. Modlin
D. Rao
National Institute of Standards and Technology
Donna K. Harman
Combining Evidence from Multiple Searches
Edward A. Fox, M. Prabhakar Koushik,
Joseph Shaw, Russell Modlin and Durgesh Rao
Department of Computer Science
Virginia Tech, Blacksburg, VA 24061-0106
Abstract
At Virginia Tech and PRC Inc. investigations with TREC data have focused on developing
and comparing mechanisms for combining evidence related to a number of search schemes. Our
work with the first CD-ROM has included various indexing, weighting, retrieval, combination,
evaluation, and failure analysis efforts. Related work reported elsewhere in the proceedings by
Paul Thompson discusses extensions undertaken by PRC Inc. and an evaluation of those results.
Future work will develop our ideas further, try them out with additional data, and hopefully be
evaluated in connection with other work on TREC and TIPSTER.
1 Overview
The 1992 TREC effort at Virginia Tech was carried out largely on a DECstation 5000 Model
25 with 40 MB of RAM. The 1985 version of the SMART retrieval system, with numerous of our
enhancements, was used for indexing, retrieval, and evaluation.
Our efforts were divided into two main phases. Prior to the TREC meeting we worked solely
with the Wall Street Journal (WSJ) on the first CD-ROM. Thus, in Phase 1, we made eight different
types of runs employing three different retrieval models - the Boolean model, the p-norm model and
the vector space model - and three different weighting schemes. The queries for the Boolean and
p-norm runs were manually generated by project team members. Vector queries were generated
automatically from the topic descriptions. The results from these runs were then merged together
to provide combined result sets. Relevance judgements were also performed on a subset of the
retrieved documents to help with training studies.
In Phase 2, after the TREC meeting, we experimented with all five collections on the first
CD-ROM. However, based on our work with the WSJ, we restricted our investigation to five out
of the original eight cases. We also explored use of limited training information in carrying out the
merging of results. Work is continuing at Virginia Tech to incorporate the results of our many runs
into merged selections, to improve performance.
Subsequent sections of this paper describe all of these activities in greater detail.
2 Indexing and Data Structures
This section outlines the indexing done with the document collection provided by NIST. Due
to limitations of available disk space, only Disc 1 was used during the experimental runs. The
documents were indexed on a DEC station 5000/25 using an enhanced version of the 1985 release
319