SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
This estimate of log 0(R I A1 ,..., AN) is understood to modify and supercede the esti-
mate produced by Eq. (2).
Since logodds are monotonically related to probabilities, a retrieval system could in
principle probability-rank the documents of the collection for the user simply by present-
ing them in descending order of the logodds values assigned to them by Eq. (3). How-
ever, since probabilities are easier for users to interpret than logodds, the conditional
logodds estimates of form log 0(R I ..... . , AN) produced by Eq. (3) were translated into
ordinary conditional probability estimates with the help of a fourth equation, the identity
P(RIAl,...,AN) =
1
1 +[OCRerr][OCRerr]1ogO(RIAi..AN)
(4)
The probability estimates so obtained for the TIPSTER data appear in the `score' column
of the rankings submitted by the Berkeley group.
Computational Arrangements
The computations required for storage and retrieval under the SLR methodology
follow roughly the order of the four equations. Documents are indexed as follows. First,
an incoming document is put through preparatory operations that include the removal of
markup and other unwanted information, the deletion of stop words, and the stemming of
the remaining words. Then for each stem, all stem statistics that can be calculated before
the query is known -- specifically, X3, X4, X5, and X6 -- are collected. Finally, these
statistics are used to compute the stem's indexing weight in the document using the for-
mula
Doc stem weight = -7.08+.77 X3-.07 X4+ 1.05 X5+.23 X6-( -6.725)
This formula is just the right side of Eq. (1) without the terms for X1 and X2, and with
the value of the prior logodds log 0(R) subtracted off in preparation for the application of
Eq. (2). The result is stored in the inverted file as the weight of the term for this docu-
ment.
Query indexing is similar. After an incoming query has been stop-listed and
stemmed, for each stem the two statistics X1 and X2 that are dependent upon query prop-
erties are calculated. The stem is then assigned as its weight in the query the value
Query stem weight = .38 X1 + .04 X2
This formula comprises the rest of the right side of Eq. (1).
To effect retrieval, the query is compared against each document with which it
shares at least one stem. For each stem contained in both query and document, the query
stem weight and the document stem weight are added, resulting in an estimate of
log 0(R I A[OCRerr]) - log 0(R) for the stem. Summing these estimates for individual stems
over all the match stems yields the value of the summation expression in the right side of
Eq. (2). This computation is analogous to the calculation of a vector product in the vector
space model, except that corresponding query and document weights are added instead of
multiplied.
79