SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
which to regress. If one could trust the Linked Dependence Assumption completely, and
if nonmatch as well as match events had been taken into explicit account in the computa-
tion of Z, one might have tried letting Z stand as-is as the desired logodds estimate. But
because such is not the situation, this would fail to correct for dependency distortion and
would give longer documents an unfair advantage. Thus it seemed advisable to perform
a corrective linear transformation on Z, and moreover to normalize Z first by dividing it
by some simple function of document length. It was found by trial and error that dividing
Z by L raised to a power of around 0.4 seemed to remove most of the visible bias toward
either very short documents or very long documents in the five collections. Logging the
entire expression was found to improve the fit to the sample data.
It would have been appropriate to include a correction for query length analogous
to the one developed for document length. However, for lack of time the necessary anal-
ysis could not be carried out.
Extrapolation to Other Collections
The regression analysis was confined to the WSJ data because relevance judge-
ments were not available for most of the other collections in sufficient quantities, or for
enough of the training queries, to make regression feasible. This circumstance brought
with it the problem of how to extrapolate the WSJ retrieval rules to the remaining four
collections.
Speaking generally, the extension of retrieval formulae to other collections is a sig-
nificant problem throughout the IR field. One would like to know how to transfer design
parameters from one collection, for which there is enough relevance data, to another col-
lection for which there may be too little data or none. If the transfer could be accom-
plished without too much loss of predictive power, the almost exclusive use by IR experi-
menters of special `test collections' could be justified more easily. We welcomed the
dearth of TIPSTER relevance data for some of the collections as an opportunity to
explore this problem. To simplify the challenge and confront it in its starkest form, we
elected to ignore entirely even such data as were available for collections other than WSJ.
The method used for the extrapolation was based on the well known statistical con-
cept of standardization of variables. The standardized value of a variable in a population
is obtained by subtracting from its observed value the variable's mean value in the popu-
lation, then dividing this difference by its standard deviation in the population. The new
standardized values have a mean of zero and a standard deviation of one. The working
assumption that was made was that a regression equation such as Eq. (1) can be carried
over and applied in another collection provided all variables involved have first been
standardized in both collections.
Although no variable values were actually standardized, the coefficients in Eq. (1)
were recalculated for each of the other four collections in such a way as to create the
same effect. The values for the population means and standard deviations used in the
recalculation of the coefficients were taken from random samples of triples taken from
the five collections. The samples were comparable to those in the
`random' subsample of WSJ query-document triples described earlier.
The algebraic details of the transformation process will not be presented here, but
the resulting modifications of the right side of the earlier WSJ form of Eq. (1) may be of
83