SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combination of Multiple Searches chapter E. Fox J. Shaw National Institute of Standards and Technology D. K. Harman Combination of Multiple Searches Edward A. Fox and Joseph A. Shaw Department of Compnter Science Virginia Tech, Blacksburg, VA 24061-0106 Abstract The TREC-2 project at Virginia Tech focused on meth- ods for combining the evidence from multiple retrieval runs to improve retrieval performance over any sin- gle retrieval method. This paper describes one such method that has been shown to increase performance by combining the similarity values from five different retrieval runs using both vector space and P-norm ex- tended boolean retrieval methods. 1 Overview The primary focus of our experiments at Virginia Tech involved methods of combining the results from vari- ous divergent search schemes and document collections. We performed both routing and ad-hoc retrieval experi- ments on the provided test collections. The results from both vector and P-norm type queries were considered in determining the probability of relevance for each docu- ment in an individual collection. The results for each collection were then merged to create a single final set of documents that would be presented to the user. 2 Index Creation This section outlines the indexing done with the doc- ument collection provided by NIST. Each of the indi- vidual collections was indexed separately as document vector files; limitations in disk space prohibited the use of inverted files and the creation of a single combined document vector file. All processing was performed on a DECstation 5000/25 with 40 MB of RAM using the 1985 release of the SMART Information Retrieval System [2], with enhancements from previous experiments as well as a new modification for our TREC-2 experiments. The index files were created from the source text via the following process. First, the source document text provided by NIST was passed through a preparser to convert the SGML-like format to the proper format for 243 Table 1: SMART weighting schemes used for TREC-2. SMART label term[OCRerr]weight = ann 0.5 + 0.5 * tf 11 max[OCRerr]tJ bnn 1 mnn tf [OCRerr] ma[OCRerr][OCRerr]t f am 0.5+ [OCRerr] * log( num[OCRerr]doc,) [OCRerr] 2*max[OCRerr]tI colI[OCRerr]fr[OCRerr]g nnn if the 1985 version of SMART. The extraneous sections of the documents were filtered out at this point. The TEXT sections of the documents, as well as the various HEADLINE, TITLE, SUMMARY, and ABSTRACT sections of the collections were indexed; all of the other sections were ignored. The subsections of the TEXT fields, where they existed, were considered as part of the TEXT field, with the subsection delimiters removed. The resulting filtered text was tokenized, stop words were deleted using the standard 418 word stop list provided with SMART, and the remaining non-noise words were included in the term dictionary along with their occurrence frequencies. Each term in the dictio- nary has a unique identification number. A document vector file was created during indexing which contains for each document its unique ID, and a vector of term IDs and term weights. The initially recorded weights can be changed based on one of several schemes after the indexing is complete. The various SMART weight- ing schemes referred to within this paper are summa- rized in Table 1. The dictionary size for each collection was approximately 16 MB, while the document vector files ranged from 31 MB to 124MB (see Table 2).