SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models
chapter
N. Fuhr
C. Buckley
National Institute of Standards and Technology
Donna K. Harman
seen in the learning sample. If the new documents would contain totally different terms, the learning
sample would be of no use for coping with these documents. For dealing with the routing queries,
mostly search term weighting methods are applied, which are well establisbed in IR.
For the ad-hoc queries, the task is more difficult: Given relevance information for some query-document
pairs, this information has to be exploited in order to rank new documents w.r.t. new queries; further-
more, most of these new query-document pairs will involve new terms. As a method for dealing with
this type of task, description-oriented indexing has been developed ([Fuhr & Buckley 91]). The major
concept of this approach is abstraction. For the routing queries, we abstract from specific documents
by regarding the presence or absence of terms. In description-oriented indexing, we have to abstract
in addition from specific queries and terms. This can be done by regarding features of these objects
instead of the objects itself. Similar to pattern recognition methods, documents, queries and terms
are described by sets of features here. This way, new documents, queries and terms can be mapped
onto sets of features which we have already seen in the learning sample.
In principle, description-oriented learning could be combined with most IR models. Only the proba-
bilistic model, however, relates directly to retrieval quality. The probability ranking principle described
in [Robertson 77] states that optimum retrieval performance is achieved when documents are ranked
according to descending values of their probability of relevance. So our probabilistic approach uses
the learning data in order to optimize retrieval quality for the test sample. This statement also holds
for our work with the routing queries, where we apply the retrieval-with-probabilistic-indexing (RPI)
model for estimating the query term weights.
In the following section, we describe the description-oriented document indexing method which yields
probabilistic weights for terms w.r.t. documents. In order to use these weights in retrieval, two methods
are applied here. When no relevance information for the specific query is available, a utility-theoretic
retrieval function can be used, where the utility weights for the terms from the query are derived by
some heuristics. With relevance feedback data, however, the RPI model can be applied. Experiments
with the ad-hoc queries are described in section 3.1, followed by the presentation of our work with
the routing queries in section 3.2.
2 Probabilistic document indexing
2.1 General approach
We first give a brief outline of the description-oriented indexing approach, which is presented in full de-
tail in [Fuhr & Buckley 91]. Based on the binary independence indexing model ([Maron & Kuhns 60],
[Fuhr 89]), one can define probabilistic document indexing weights as follows: let dm denote a doc-
ument, t[OCRerr] a term and R the fact that a query-document pair is judged relevant, then P(RIi[OCRerr], dm)
denotes the probability that document drn will be judged relevant w.r.t. an arbitrary query that con-
tains term i[OCRerr] These weights can hardly be estimated directly, since there will not be enough relevance
information available for a specific document. Instead, the description-oriented indexing approach is
applied, where the indexing task is divided into two steps, namely a description step and a decision
step.
In the description step, term-document pairs (i[OCRerr], dm) are mapped onto so-called relevance descriptions
[OCRerr] dm). The elements X[OCRerr] of the relevance description contain values of features of [OCRerr] dm and their
relationship, like e.g.
90