SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models chapter N. Fuhr C. Buckley National Institute of Standards and Technology Donna K. Harman if within-document frequency (wdf) of [OCRerr] logidf = log(inverse document frequency), lognumierms = log(number of different terms in dm), imaxif = 1/(maximum wdf of a term in dm). In the experiments described in the following, we only use these simple parameters. However, it should be emphasized that the concept of relevance description is a means for coping with rather complex forms of document representations, e.g. when advanced natural language processing techniques are applied. In the decision step, probabilistic indexing weights of the form P(RIx[OCRerr]t[OCRerr], dm)) are estimated. Instead of the probability P(RIt[OCRerr], drn) which relates to a specific document and a specific term, P(RIx[OCRerr]ii, dm)) is the probability that an arbitrary term-document pair having relevance description i will be involved in a relevant query-document relationship. This probability is estimated by a s[OCRerr]called indexing function u(x). Different regression methods or probabilistic classification algorithms can serve as indexing function. Here, we use polynomial regression. For this purpose, a polynomial structure = ....... , VL) has to be defined as v(x) = (1, x1, x2, . . . , XN, X21, X1X2, where N is the number of dimensions of x. The class of polynomials is given by the components X3m ... (i,j,k,... E [1, N]; l,m,n,... > 0) which are to be included in the polynomial. The indexing function now yields u(x[OCRerr] - [OCRerr] v(i), where a = (.1...., aL)T is the coefficient vector which has to be estimated in a separate training phase preceding the application of the indexing function. So P(RIx[OCRerr] is approximated by the polynomial u(x[OCRerr] = al+a2 xl +a3.x2+...+aN+1 .XN+aN+2 x21 +aN+3 xl x2+.. In the training phase, a learning sample is derived from relevance feedback data in the following way: for every query-document pair (q[OCRerr], drn), a learning sample element (x[OCRerr]i[OCRerr], dm), y) is generated for each term t[OCRerr] common to q[OCRerr] and dm with y = 1, if dm is relevant w.r.t. q[OCRerr], and y = 0 otherwise. Given this set of learning sample elements, a linear regression method is applied in order to minimize the squared error (y - u(x))2 (see [OCRerr]Fuhr & Buckley 91] for the details). The result of the regression method is the coefficient vector a which defines the indexing function. In the application phase, for each term occurring in a document, a relevance description is constructed first and then the indexing function gives the indexing weight of the term in the document. 2.2 Experiments We developed two types of document indexing, one with single terms only and one with both single words and phrases. For the single words, we initially used a linear indexing function, which gave us the function: u(x(ii,dm)) = -0.0019844 + 0.0000296 if + 0.0001079 imaxif + 0.0003519 logidf + 0.0003068 lognumierms. Retrieval experiments with this type of document indexing, however, revealed that the few extremly long documents (Federal Register documents with up to 2.6 MB) were always retrieved in the top 91