SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models
chapter
N. Fuhr
C. Buckley
National Institute of Standards and Technology
Donna K. Harman
if within-document frequency (wdf) of [OCRerr]
logidf = log(inverse document frequency),
lognumierms = log(number of different terms in dm),
imaxif = 1/(maximum wdf of a term in dm).
In the experiments described in the following, we only use these simple parameters. However, it should
be emphasized that the concept of relevance description is a means for coping with rather complex
forms of document representations, e.g. when advanced natural language processing techniques are
applied.
In the decision step, probabilistic indexing weights of the form P(RIx[OCRerr]t[OCRerr], dm)) are estimated. Instead of
the probability P(RIt[OCRerr], drn) which relates to a specific document and a specific term, P(RIx[OCRerr]ii, dm)) is
the probability that an arbitrary term-document pair having relevance description i will be involved
in a relevant query-document relationship. This probability is estimated by a s[OCRerr]called indexing
function u(x). Different regression methods or probabilistic classification algorithms can serve as
indexing function. Here, we use polynomial regression. For this purpose, a polynomial structure
= ....... , VL) has to be defined as
v(x) = (1, x1, x2, . . . , XN, X21, X1X2,
where N is the number of dimensions of x. The class of polynomials is given by the components
X3m ... (i,j,k,... E [1, N]; l,m,n,... > 0) which are to be included in the polynomial.
The indexing function now yields u(x[OCRerr] - [OCRerr] v(i), where a = (.1...., aL)T is the coefficient vector
which has to be estimated in a separate training phase preceding the application of the indexing
function. So P(RIx[OCRerr] is approximated by the polynomial
u(x[OCRerr] = al+a2 xl +a3.x2+...+aN+1 .XN+aN+2 x21 +aN+3 xl x2+..
In the training phase, a learning sample is derived from relevance feedback data in the following way:
for every query-document pair (q[OCRerr], drn), a learning sample element (x[OCRerr]i[OCRerr], dm), y) is generated for each
term t[OCRerr] common to q[OCRerr] and dm with y = 1, if dm is relevant w.r.t. q[OCRerr], and y = 0 otherwise. Given this
set of learning sample elements, a linear regression method is applied in order to minimize the squared
error (y - u(x))2 (see [OCRerr]Fuhr & Buckley 91] for the details). The result of the regression method is the
coefficient vector a which defines the indexing function.
In the application phase, for each term occurring in a document, a relevance description is constructed
first and then the indexing function gives the indexing weight of the term in the document.
2.2 Experiments
We developed two types of document indexing, one with single terms only and one with both single
words and phrases.
For the single words, we initially used a linear indexing function, which gave us the function:
u(x(ii,dm)) = -0.0019844 +
0.0000296 if +
0.0001079 imaxif +
0.0003519 logidf +
0.0003068 lognumierms.
Retrieval experiments with this type of document indexing, however, revealed that the few extremly
long documents (Federal Register documents with up to 2.6 MB) were always retrieved in the top
91