SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Overview of the First Text REtrieval Conference (TREC-1)
chapter
D. Harman
National Institute of Standards and Technology
Donna K. Harman
"hit" versus the probability of false alarm, or ale probability of a "false drop". The x axis plots the probability
of false alarm, calculated as follows
Probability of false alarm =
number of nonrelevant items retrieved
total number of nonrelevant items in collection
The y axis plots the probability of detection, calculated as
Probability of detection =
number of relevant items retrieved
total number of relevant items in collection
Note that the probability of detection is the same as recall, and the probability of false alaIn is the same as fal-
lout, an older measure in information retrieval (Salton & McGill 1983). These measures are for a single topic,
but averages can be computed similarly to the recall-level averages by using probability of detection at fixed
false alarm rates. The tables in Appendix A show both this average ROC curve and the same curve plotted on
probability scales (Swets 1969).
5. Preliminary Results
5.1 Introduction
The results of the ThEC-l conference should be viewed only as a preliminary baseline for what can be
expected from systems working with large test collections. There are several reasons for this. First, the dead-
lines for results were very tight, and most groups had minimal time for experiments. As discussed earlier, the
huge scale-up in the size of the document collection required major work from all groups in rebuilding their sys-
tems. Much of this work was simply a system engineering task: finding reasonable data structures to use, get-
ting indexing routines to be efficient enough to finish indexing the data, finding enough storage to handle the
large inverted files and other structures, etc.
The second reason these results are preliminary is that groups were working blindly as to what constitutes a
relevant document. There were no reliable relevance judgments for training, and the use of the long topics was
completely new. This means that results were heavily influenced by an almost random selection of what parts
of the topic to use. Groups also had to make often primitive adjustments to basic algorithms in order to get
results, with litfie evidence of how well these adjustments were working. The large scale of the whole evalua-
tion precluded any tuning without some relevance judgments, and the relevance judgments that were provided
were generally sparse and sometimes inaccurate. These problems particularly affected those systems that needed
training for routing.
Many of the papers in the proceedings show some new results from work done in the short amount of time
between the conference and the due date of the papers (less than 2 months). Some of the improvements are
very significant, and the improvements seen in the [OCRerr]PSThR results (where the results are a second-try at this
task) are large. It can be expected that the results seen at the second TREC conference will be much better1 and
also more indicative of how well a method works.
Because these results are preliminary, they should he compared very carefully. Some very broad conclusions
can be drawn, but no methods should be conclusively judged inferior or superior at this point.
5.2 Adhoc Results
The adhoc evaluation used new topics (51-100) against the two disks of documents (Dl + D2). There were
33 sets of results for adhoc evaluation in TREC, with 20 of them based on runs for the full data set. Of these,
13 used automatic construction of queries, 6 used manual construction, and 1 used feedback. Figure 5 shows the
recall/precision curve for the three TREC-1 runs with the highest 11-point averages using automatic construction
of queries. These curves were all based on the use of the Cornell SMART system, but with important varia-
tions. The "fuhrpl" results came from using the training data to find parameter weights (see Fuhr & Buckley
paper), the "crnlpl" results came from doing local and global term weighting without training data (see Buckley,
Salton & Allan paper), and the "siemsl" results came from using term expansion with terms from "Wordnet"
(see Voorhees & Hou paper).
13