SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Combining Evidence from Multiple Searches
chapter
E. Fox
M. Koushik
J. Shaw
R. Modlin
D. Rao
National Institute of Standards and Technology
Donna K. Harman
* Disk Space Limitations:
We had problems acquiring the required disk space in time for the submission of results due
to lateness of the promised award from DARPA. A great deal of time was wasted in making
partial runs using a hodgepodge of computers, with NFS mounting of remote files slowing
processing by almost an order of magnitude. Ultimately we started over when told that work
on the WSJ would suffice.
This forced us to submit a subset of results in Phase 1, and to only complete work on Disc 1
during Phase 2. With more disk space we could have use inverted files for indexing the data
and that would have made things much faster. It would have allowed real time interactive
searching which would have speeded up and improved the quality of our relevance judgement
operation. Also, with more disk space, we could have used an RS/6000, to get runs done more
quickly, assuming SMART could be ported and made fully operational on that platform.
* Merging of Retrieval Runs:
Ideally, an effective method to predict relevance like the Decision Tree or CEO model should
be used to take the results from the various runs and determine the relevant documents based
upon the training set and the relevance judgements. However, in Phase 1 we did not have
enough time to properly implement these, so we settled on a less effective approach - using
the ranks of the documents from each of the runs. The top N documents from each run were
included in the results set, ordered based upon the ranks. The value of N was determined by
the number of document number repetitions across the various runs. The average value of N
was near 40.
This method used to combine the results from the various runs was flawed. Using the top N
documents from each run assumes that each of the runs was of equal quality, which is usually
not the case. Each poor run would have many non-relevant documents in the final list,
while a quality run could have many relevant documents miss the cutoff of N. The similarity
values should be used to determine the top-ranked documents to p[OCRerr]ll from each run, but the
combinations of runs with various incompatible similarity measures made this a non-trivial
task.
In Phase 2, we used additional training data, namely the recall-precision average values from
the evaluation of each run whose results were to be merged. This approach, too, is flawed,
since the averages reflect general trends, while query-specific trends would be much more
appropriate to use.
* Faulty Queries:
No feedback i[OCRerr]iodification could be performed to improve retrieval performance. The Boolean
queries proved to be too general in many cases. The p-norm runs were made with the
Boolean queries; separate p-norm queries (i.e., that made use of user-assigned p-values and
query weights) would have improved retrieval effectiveness. The vector queries were generally
good, but also had some faults. In particular, shorter vector queries might have led to
better Similarity Merge behavior by decreasing the likelihood of spurious matches in sub-
collections with few relevant documents. Also, the NOT clauses from the topics descriptions
were included in the vector queries, which may have contributed to retrieval of non-relevant
documents. The effect of bad queries and an inadequate combination method was lower
retrieval quality than desired.
327