SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
interest. They are
ForAP: -7.21 +.40X1 +.04X2+.88 x3 -.iox4+ l.09X5+.25X6
ForDOE: -7.51 +.44X1 +.05 X2+ 1.18 X3 -.12X4+.94X5+.24X6
ForFR. -6.83+.44X1 +.04X2+.57 X3 -.06X4+ 1.13X5+.24X6
ForZIFF.-6.95+.38X1+.04X2+.68X3-.07X4+1.03X5+.21X6
As an illustration of what the transformation of coefficients has accomplished, one sees
that the coefficient for X3, the absolute frequency of the match term in the document, is
largest for the DOE collection and smallest for the FR collection. This serves to compen-
sate for the fact that the average document length is smallest in DOE and largest in FR.
No modifications were made of the coefficients in the second-stage regression
equation, Eq. (3), when applying it in other collections. Because adjustments had already
been made for inter-collection differences in the first-stage equation, the investigators
were not convinced that further adjustments would be profitable at the second stage.
Indeed we are not entirely confident that the adjustments are a good idea even for all the
variables at the first stage, but thought it worth the experiment.
Effectiveness Scores
In the TREC evaluations the 11-point average recall calculated for ad hoc retrieval
by the Berkeley system was 0.151; the average number of relevant documents retrieved
in the top-ranked 100 documents for each request was 40.8; and the average number of
relevant documents retrieved in the top-ranked 200 documents for each request was 67.9.
The comparable three figures for the medians of all systems submitting results for ad hoc
retrieval were .157, 39.5, and 62.5 respectively. Comparing the Berkeley system's scores
against the median scores it can be seen that for the number of relevant documents
retrieved among the top 200, the Berkeley system exceeded the median by 5.4 docu-
ments, an amount that can be shown (in a paired comparison t-test over the mean of the
differences for the 50 topics) to be statistically significant at the 0.05 level. The other two
scores do not differ from the corresponding median scores by what are customarily
regarded as statistically significant amounts. Thus by one of the three measures the SLR
experimental system was significantly more effective than the median system, and by the
other two it was not significantly different from them.
To put these results in perspective it should be remembered that unlike other sys-
tems included in the comparison group, the SLR system was primitive in every respect
except for its statistical logic. It involved no human reformulation of topics or other man-
ual intervention, it used no relevance feedback, and it employed no special linguistic or
other devices such as parsing systems, thesauri, disambiguation, phrase identification, or
global/local combinations of evidence. It even forewent the use of training data from
four of the five collections. Yet by a statistically careful use of simple frequency infor-
mation alone, it was able to hold its own against typical systems that used much more
elaborate forms of evidence.
84