SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman interest. They are ForAP: -7.21 +.40X1 +.04X2+.88 x3 -.iox4+ l.09X5+.25X6 ForDOE: -7.51 +.44X1 +.05 X2+ 1.18 X3 -.12X4+.94X5+.24X6 ForFR. -6.83+.44X1 +.04X2+.57 X3 -.06X4+ 1.13X5+.24X6 ForZIFF.-6.95+.38X1+.04X2+.68X3-.07X4+1.03X5+.21X6 As an illustration of what the transformation of coefficients has accomplished, one sees that the coefficient for X3, the absolute frequency of the match term in the document, is largest for the DOE collection and smallest for the FR collection. This serves to compen- sate for the fact that the average document length is smallest in DOE and largest in FR. No modifications were made of the coefficients in the second-stage regression equation, Eq. (3), when applying it in other collections. Because adjustments had already been made for inter-collection differences in the first-stage equation, the investigators were not convinced that further adjustments would be profitable at the second stage. Indeed we are not entirely confident that the adjustments are a good idea even for all the variables at the first stage, but thought it worth the experiment. Effectiveness Scores In the TREC evaluations the 11-point average recall calculated for ad hoc retrieval by the Berkeley system was 0.151; the average number of relevant documents retrieved in the top-ranked 100 documents for each request was 40.8; and the average number of relevant documents retrieved in the top-ranked 200 documents for each request was 67.9. The comparable three figures for the medians of all systems submitting results for ad hoc retrieval were .157, 39.5, and 62.5 respectively. Comparing the Berkeley system's scores against the median scores it can be seen that for the number of relevant documents retrieved among the top 200, the Berkeley system exceeded the median by 5.4 docu- ments, an amount that can be shown (in a paired comparison t-test over the mean of the differences for the 50 topics) to be statistically significant at the 0.05 level. The other two scores do not differ from the corresponding median scores by what are customarily regarded as statistically significant amounts. Thus by one of the three measures the SLR experimental system was significantly more effective than the median system, and by the other two it was not significantly different from them. To put these results in perspective it should be remembered that unlike other sys- tems included in the comparison group, the SLR system was primitive in every respect except for its statistical logic. It involved no human reformulation of topics or other man- ual intervention, it used no relevance feedback, and it employed no special linguistic or other devices such as parsing systems, thesauri, disambiguation, phrase identification, or global/local combinations of evidence. It even forewent the use of training data from four of the five collections. Yet by a statistically careful use of simple frequency infor- mation alone, it was able to hold its own against typical systems that used much more elaborate forms of evidence. 84