NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Combining Evidence from Multiple Searches chapter E. Fox M. Koushik J. Shaw R. Modlin D. Rao National Institute of Standards and Technology Donna K. Harman Table 6: Base Runs + Similarity Merge + R-P Merge Run/Collection AP DOE FR WSJ ZF Sim-Merge cosine.atn 0.1138 0.0543 0.0259 0.2740 0.813 0.1149 cosine.nnn 0.1890 0.0330 0.0504 0.2184 0.0946 0.1513 inner.atn 0.1241 0.0609 0.0405 0.3224 0.0888 0.1717 inner.nnn 0.1478 0.0252 0.0108 0.1329 0.0101 0.0075 pnorml.0 0.3006 0.0876 0.0727 0.3085 0.1448 0.1831 R-P Merge 0.2268 0.0554 0.0521 0.2555 0.1003 0.1523 2. Associate with each run an estimate of the probability that the top item on the stack is relevant. Initially, this value is the precision value at recall 0.0 from our evaluation. 3. To draw the next item for the merged result, identify the stack with the highest estimated probability of relevance, pop the top of the stack, and update the probability estimate. 4. To update the probability estimate, use interpolation based on the number of popped items (i.e., the number "retrieved") and the rest of the recall-precision results. 5. Continue the process starting again with step 3, as long as less than 200 items have been drawn. The results of applying this "R-P Merge" algorithm are given in Table 6, on the last line. Note that the first 5 columns of that line show the results of merging for a particular collection, and the very last value reflects Similarity Merge followed by R-P Merge. From Table 6 we see that the R-P Merge method does not yield results that are as good as the best individual run. In particular, we could simply use the pnorml.O results uniformly and do better than with merging. Further improvement in the above algorithm, possibly yielding more accurate estimates in step 4, will be investigated in 1993. Other studies will consider if recall-precision data for each query could be used in a similar training situation, for subsequent testing on Disc 2. 7 Evaluation 7.1 Software engineering We began with the 1985 version of SMART, and have enhanced it. We tried for a long period of time to use the new version of SMART on an RS/6000 but found the use of disk space to be excessive. Since we could not get reliable results, we went back to the older version. We underwent extensive software development since May. This included writing of C programs and Unix shell scripts to partially automate indexing, retrieval, relevance judgement making, merg- ing, evaluation and tabulation of results. 7.2 Problems and Failure Analysis Problems encountered in the project were partially identified. A failure analysis was performed on a subset of documents that failed to be included in our result set in Phase 1. Observations regarding our merging methods came from further studies in Phase 2. The following observations aim to summarize our findings. 326