SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Combining Evidence from Multiple Searches
chapter
E. Fox
M. Koushik
J. Shaw
R. Modlin
D. Rao
National Institute of Standards and Technology
Donna K. Harman
Table 6: Base Runs + Similarity Merge + R-P Merge
Run/Collection AP DOE FR WSJ ZF Sim-Merge
cosine.atn 0.1138 0.0543 0.0259 0.2740 0.813 0.1149
cosine.nnn 0.1890 0.0330 0.0504 0.2184 0.0946 0.1513
inner.atn 0.1241 0.0609 0.0405 0.3224 0.0888 0.1717
inner.nnn 0.1478 0.0252 0.0108 0.1329 0.0101 0.0075
pnorml.0 0.3006 0.0876 0.0727 0.3085 0.1448 0.1831
R-P Merge 0.2268 0.0554 0.0521 0.2555 0.1003 0.1523
2. Associate with each run an estimate of the probability that the top item on the stack is
relevant. Initially, this value is the precision value at recall 0.0 from our evaluation.
3. To draw the next item for the merged result, identify the stack with the highest estimated
probability of relevance, pop the top of the stack, and update the probability estimate.
4. To update the probability estimate, use interpolation based on the number of popped items
(i.e., the number "retrieved") and the rest of the recall-precision results.
5. Continue the process starting again with step 3, as long as less than 200 items have been
drawn.
The results of applying this "R-P Merge" algorithm are given in Table 6, on the last line. Note
that the first 5 columns of that line show the results of merging for a particular collection, and the
very last value reflects Similarity Merge followed by R-P Merge.
From Table 6 we see that the R-P Merge method does not yield results that are as good as
the best individual run. In particular, we could simply use the pnorml.O results uniformly and do
better than with merging.
Further improvement in the above algorithm, possibly yielding more accurate estimates in step
4, will be investigated in 1993. Other studies will consider if recall-precision data for each query
could be used in a similar training situation, for subsequent testing on Disc 2.
7 Evaluation
7.1 Software engineering
We began with the 1985 version of SMART, and have enhanced it. We tried for a long period
of time to use the new version of SMART on an RS/6000 but found the use of disk space to be
excessive. Since we could not get reliable results, we went back to the older version.
We underwent extensive software development since May. This included writing of C programs
and Unix shell scripts to partially automate indexing, retrieval, relevance judgement making, merg-
ing, evaluation and tabulation of results.
7.2 Problems and Failure Analysis
Problems encountered in the project were partially identified. A failure analysis was performed
on a subset of documents that failed to be included in our result set in Phase 1. Observations
regarding our merging methods came from further studies in Phase 2. The following observations
aim to summarize our findings.
326