SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models chapter N. Fuhr C. Buckley National Institute of Standards and Technology Donna K. Harman seen in the learning sample. If the new documents would contain totally different terms, the learning sample would be of no use for coping with these documents. For dealing with the routing queries, mostly search term weighting methods are applied, which are well establisbed in IR. For the ad-hoc queries, the task is more difficult: Given relevance information for some query-document pairs, this information has to be exploited in order to rank new documents w.r.t. new queries; further- more, most of these new query-document pairs will involve new terms. As a method for dealing with this type of task, description-oriented indexing has been developed ([Fuhr & Buckley 91]). The major concept of this approach is abstraction. For the routing queries, we abstract from specific documents by regarding the presence or absence of terms. In description-oriented indexing, we have to abstract in addition from specific queries and terms. This can be done by regarding features of these objects instead of the objects itself. Similar to pattern recognition methods, documents, queries and terms are described by sets of features here. This way, new documents, queries and terms can be mapped onto sets of features which we have already seen in the learning sample. In principle, description-oriented learning could be combined with most IR models. Only the proba- bilistic model, however, relates directly to retrieval quality. The probability ranking principle described in [Robertson 77] states that optimum retrieval performance is achieved when documents are ranked according to descending values of their probability of relevance. So our probabilistic approach uses the learning data in order to optimize retrieval quality for the test sample. This statement also holds for our work with the routing queries, where we apply the retrieval-with-probabilistic-indexing (RPI) model for estimating the query term weights. In the following section, we describe the description-oriented document indexing method which yields probabilistic weights for terms w.r.t. documents. In order to use these weights in retrieval, two methods are applied here. When no relevance information for the specific query is available, a utility-theoretic retrieval function can be used, where the utility weights for the terms from the query are derived by some heuristics. With relevance feedback data, however, the RPI model can be applied. Experiments with the ad-hoc queries are described in section 3.1, followed by the presentation of our work with the routing queries in section 3.2. 2 Probabilistic document indexing 2.1 General approach We first give a brief outline of the description-oriented indexing approach, which is presented in full de- tail in [Fuhr & Buckley 91]. Based on the binary independence indexing model ([Maron & Kuhns 60], [Fuhr 89]), one can define probabilistic document indexing weights as follows: let dm denote a doc- ument, t[OCRerr] a term and R the fact that a query-document pair is judged relevant, then P(RIi[OCRerr], dm) denotes the probability that document drn will be judged relevant w.r.t. an arbitrary query that con- tains term i[OCRerr] These weights can hardly be estimated directly, since there will not be enough relevance information available for a specific document. Instead, the description-oriented indexing approach is applied, where the indexing task is divided into two steps, namely a description step and a decision step. In the description step, term-document pairs (i[OCRerr], dm) are mapped onto so-called relevance descriptions [OCRerr] dm). The elements X[OCRerr] of the relevance description contain values of features of [OCRerr] dm and their relationship, like e.g. 90