SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models
chapter
N. Fuhr
C. Buckley
National Institute of Standards and Technology
Donna K. Harman
ranks. This effect was due to the positive coefficient of lognumierms, which gave indexing weights
larger than 1 for the terms in these documents, thus yielding high retrieval status value for these
documents w.r.t. most queries.
Obviously a linear indexing function is not well suited for coping with such extreme distributions of
the elements of the relevance description. Besides the effect on the few very long documents, other
documents are also affected by this distribution, for the following reason: Many terms in the long
documents will be assigned indexing weights larger than 1. For the computation of the squared error
that is to be minimized, the quadratic difference (y - u(x[OCRerr])2 between the indexing weight u(x) and
the actual relevance decision y (0 or 1) in the learning sample is considered. Even if y = 1, there is
still the difference (u(x) - 1)2, for u(x[OCRerr] > 1. In order to reduce this error and thus the overall error,
a larger error for other indexing weights is taken into account. There are three possible strategies for
coping with this problem:
* Some experiments performed with other material have shown that overall indexing quality can
be improved by excluding outliers from the learning sample ([Pfeifer 91]). As outliers, not only
those pairs with weights lying on the "correct side" of the interval [0,1] (i.e. either y = 0 and
u(x) < 0 or y = 1 and u(x[OCRerr] > 1) should be regarded. Also the exclusion of those pairs with
weights lying on the "wrong side" (i.e. either y = 0 and u(x[OCRerr] > 1 or y = 1 and u(x[OCRerr] < 0) yields
better results.
* Using logistic functions instead of linear functions overcomes the problem of indexing weights
outside the interval [0,1]. However, we have some experimental evidence ([Pfeifer 90]) that even
in this case the removal of outliers from the learning sample (where the outliers are defined w.r.t.
a linear indexing function) improves the indexing quality.
* For the experiments described here, we switched from linear to polynomial functions, which also
gave fairly good results.
The function finally used for indexing with single terms only is
u(i) = 0.00042293 +
0.00150083 if logidf imaxif +
-0.00150665 if imaxif +
0.00010465 logidf +
-0.00122627 lognumierms imaxif.
F or indexing with phrases, we first had to derive the set of phrases to be considered. We took all
adjacent non-stopwords that occurred at least 25 times in the D1 (training) document set. Then
an indexing function common for single words and phrases was developed by introducing additional
binary factors is[OCRerr]single (is[OCRerr]phrase) having the value 1 if the term is a single word (a phrase) and 0
otherwise. The parameters if and logidf were defined for phrases in the same way as for single words,
and in the computation of imaxif and lognumierms for a document both phrases and single words
were considered. This procedure produced the indexing function
u(x[OCRerr] = 0.00034104 +
0.00141097 is[OCRerr]single if logidf imaxif +
-0.00119826 is[OCRerr]single if imaxif +
0.00014122 is[OCRerr]single logidf +
-0.00120413 lognurnierms zmaxif +
0.00820515 is[OCRerr]phrase if logidf . imaxif +
-0.04188466 is[OCRerr]phrase if imaxif +
0.00114585 is[OCRerr]phrase logidf.
92