SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Combining Evidence from Multiple Searches chapter E. Fox M. Koushik J. Shaw R. Modlin D. Rao National Institute of Standards and Technology Donna K. Harman * Disk Space Limitations: We had problems acquiring the required disk space in time for the submission of results due to lateness of the promised award from DARPA. A great deal of time was wasted in making partial runs using a hodgepodge of computers, with NFS mounting of remote files slowing processing by almost an order of magnitude. Ultimately we started over when told that work on the WSJ would suffice. This forced us to submit a subset of results in Phase 1, and to only complete work on Disc 1 during Phase 2. With more disk space we could have use inverted files for indexing the data and that would have made things much faster. It would have allowed real time interactive searching which would have speeded up and improved the quality of our relevance judgement operation. Also, with more disk space, we could have used an RS/6000, to get runs done more quickly, assuming SMART could be ported and made fully operational on that platform. * Merging of Retrieval Runs: Ideally, an effective method to predict relevance like the Decision Tree or CEO model should be used to take the results from the various runs and determine the relevant documents based upon the training set and the relevance judgements. However, in Phase 1 we did not have enough time to properly implement these, so we settled on a less effective approach - using the ranks of the documents from each of the runs. The top N documents from each run were included in the results set, ordered based upon the ranks. The value of N was determined by the number of document number repetitions across the various runs. The average value of N was near 40. This method used to combine the results from the various runs was flawed. Using the top N documents from each run assumes that each of the runs was of equal quality, which is usually not the case. Each poor run would have many non-relevant documents in the final list, while a quality run could have many relevant documents miss the cutoff of N. The similarity values should be used to determine the top-ranked documents to p[OCRerr]ll from each run, but the combinations of runs with various incompatible similarity measures made this a non-trivial task. In Phase 2, we used additional training data, namely the recall-precision average values from the evaluation of each run whose results were to be merged. This approach, too, is flawed, since the averages reflect general trends, while query-specific trends would be much more appropriate to use. * Faulty Queries: No feedback i[OCRerr]iodification could be performed to improve retrieval performance. The Boolean queries proved to be too general in many cases. The p-norm runs were made with the Boolean queries; separate p-norm queries (i.e., that made use of user-assigned p-values and query weights) would have improved retrieval effectiveness. The vector queries were generally good, but also had some faults. In particular, shorter vector queries might have led to better Similarity Merge behavior by decreasing the likelihood of spurious matches in sub- collections with few relevant documents. Also, the NOT clauses from the topics descriptions were included in the vector queries, which may have contributed to retrieval of non-relevant documents. The effect of bad queries and an inadequate combination method was lower retrieval quality than desired. 327