SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman This estimate of log 0(R I A1 ,..., AN) is understood to modify and supercede the esti- mate produced by Eq. (2). Since logodds are monotonically related to probabilities, a retrieval system could in principle probability-rank the documents of the collection for the user simply by present- ing them in descending order of the logodds values assigned to them by Eq. (3). How- ever, since probabilities are easier for users to interpret than logodds, the conditional logodds estimates of form log 0(R I ..... . , AN) produced by Eq. (3) were translated into ordinary conditional probability estimates with the help of a fourth equation, the identity P(RIAl,...,AN) = 1 1 +[OCRerr][OCRerr]1ogO(RIAi..AN) (4) The probability estimates so obtained for the TIPSTER data appear in the `score' column of the rankings submitted by the Berkeley group. Computational Arrangements The computations required for storage and retrieval under the SLR methodology follow roughly the order of the four equations. Documents are indexed as follows. First, an incoming document is put through preparatory operations that include the removal of markup and other unwanted information, the deletion of stop words, and the stemming of the remaining words. Then for each stem, all stem statistics that can be calculated before the query is known -- specifically, X3, X4, X5, and X6 -- are collected. Finally, these statistics are used to compute the stem's indexing weight in the document using the for- mula Doc stem weight = -7.08+.77 X3-.07 X4+ 1.05 X5+.23 X6-( -6.725) This formula is just the right side of Eq. (1) without the terms for X1 and X2, and with the value of the prior logodds log 0(R) subtracted off in preparation for the application of Eq. (2). The result is stored in the inverted file as the weight of the term for this docu- ment. Query indexing is similar. After an incoming query has been stop-listed and stemmed, for each stem the two statistics X1 and X2 that are dependent upon query prop- erties are calculated. The stem is then assigned as its weight in the query the value Query stem weight = .38 X1 + .04 X2 This formula comprises the rest of the right side of Eq. (1). To effect retrieval, the query is compared against each document with which it shares at least one stem. For each stem contained in both query and document, the query stem weight and the document stem weight are added, resulting in an estimate of log 0(R I A[OCRerr]) - log 0(R) for the stem. Summing these estimates for individual stems over all the match stems yields the value of the summation expression in the right side of Eq. (2). This computation is analogous to the calculation of a vector product in the vector space model, except that corresponding query and document weights are added instead of multiplied. 79