SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-II Routing Experiments with the TRW/Paracel Fast Data Finder
chapter
M. Mettler
National Institute of Standards and Technology
D. K. Harman
5.0 Analysis of Results
The results from our two TREC runs Crable III) are summarized below. The proximity
queries CFRWl) scored at or above the median on 28 topics (including three topics which
achieved the best score) and below median for 22 topics. While not bad, our proximity
queries did not perform as well as we'd hoped. Where the proximity queries did poorly,
we attribute this primarily to poor term selection. One such case was Topic 65, Information
Retrieval Systems. We made two errors: first, due to an oversight, one of the manually
entered query terms was overly broad; second, the query author considered database
systems to be "information retrieval systems". We feel that this is a fault of the query
formulation, not of the assessments. If we had checked the training assessments, we would
have eliminated the terms from our query. Any system which relies solely on the topic
statement will run afoul of this problem. Systems which make use of user-supplied
relevance information will achieve better performance. Our results again demonstrate the
inherent problems with basically boolean query formulations. Our efforts at using
proximity to soften the boolean were not sufficient to overcome this weakness.
Run High Above Med Median Below Med Low 11-pt avg
TRW1 3 19 6 22 0 0.2525
TRW2 5 27 5 13 0 0.3459
The statistical queries (TRW2) did much better, scoring at or above the median on 37 of the
topics. The 11-pt average and total relevant documents retrieved figures were excellent and
are close to the best academic groups. The adaptations to run the statistical queries on the
FDF hardware evidently did not hurt performance. We again observed that the dominant
factor in achieving good performance is proper term selection. The details of the term
weight calculations didn't seem to make much difference except to influence which terms
were selected. We tried a number of different schemes for generating term weights
including using various statistical parameters, log weighted coefficients, and converting
terms present in all sample documents to boolean ANDs in the query. For a given set of
terms, we did not find much difference in performance between these schemes. The
scheme used for TRW2 was one of the simpler ones we tried.
We were also interested in evaluating the use of phrases and special features as additional
terms in our statistical queries. They seemed to help, but not dramatically. This was a
disappointment. Looking at the results topic by topic however, we observed a lot of
variation. For some topics, the addition of a key phrase or special feature helped a great
deal. This indicates that use of phrases and special features has promise for improving
performance, but that we just have not learned how and when to employ them. For
example, our term weighting scheme this year didn't account for term interdependence.
Particularly when we start mixing single word terms with phrases and special features that
contain those same terms, it would seem the algorithm could be improved by explicitly
accounting for this redundancy.
207