SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection
chapter
N. Fuhr
U. Pfeifer
C. Bremkamp
M. Pollmann
National Institute of Standards and Technology
D. K. Harman
Probabilistic Learning Approaches for Indexing and Retrieval with the
TREC-2 Collection
Norbert Fuhr, Ulrich Pfeifer, Christoph Bremkamp, Michael Pollmann
University of Dortmund, Germany
Chris
Cornell
Buckley
University
Abstract
In this paper, we describe the application of probabilis-
tic models for indexing and retrieval with the TREC-2
collection. This database consists of about a million
documents (2 gigabytes of data) and 100 queries (50
routing and 50 adhoc topics). For document indexing,
we use a description-oriented approach which exploits
relevance feedback data in order to produce a probabilis-
tic indexing with single terms as well as with phrases.
With the adhoc queries, we present a new query term
weighting method based on a training sample of other
queries. For the routing queries, the RPI model is ap-
plied which combines probabilistic indexing with query
term weighting based on query-specific feedback data.
The experimental results of our approach show very
good performance for both types of queries.
1 Introduction
The good TREC-1 results of our group described in
[Fuhr & Buckley 93] have confirmed the general concept
of probabilistic retrieval as a learning approach. In this
paper, we describe some improvements of the indexing
and retrieval procedures. For that, we first give a brief
outline of the document indexing procedure which is
based on description-oriented indexing in combination
with polynomial regression. Section 3 describes query
term weighting for adhoc queries, where we have devel-
oped a new learning method based on a training sam-
ple of other queries and corresponding relevance judge-
ments. In section 4, the construction of the routing
queries is presented, which is based on the probabilistic
RPI retrieval model for query-specific feedback data. In
the final conclusions, we suggest some further improve-
ments of our method.
tails): Let dm denote a document, t[OCRerr] a term and R the
fact that a query-document pair is judged relevant, then
P(RIt[OCRerr], dm) denotes the probability that document dm
will be judged relevant w.r.t. an arbitrary query that
contains term t[OCRerr] Since these weights can hardly be es-
timated directly, we use the description-oriented index-
ing approach. Here term-document pairs Yi[OCRerr] dm) are
mapped onto so-called relevance descriptions [OCRerr] dm).
The elements X[OCRerr] of the relevance description contain val-
ues of features of t[OCRerr] dm and their relationship, like e.g.
if within-documeut frequency (wdf) of ii[OCRerr]
logidf = log(iuverse document frequency),
1ognumi[OCRerr]rms = log(number of different terms in dm),
imarif = 1/(maximum wdf of a term in dm)
is[OCRerr]single =1, if term is a single word, =0 otherwise
i8[OCRerr]phrase =1, if term is a phrase, =0 otherwise.
(As phrases, we considered all adjacent non-stopwords
that occurred at least 25 times in the D1 +D2 (training)
document set.)
Based on these relevance descriptions, we estimate
the probability P(Rjx}t[OCRerr], dm)) that an arbitrary term-
document pair having relevance description i will be in-
volved in a relevant query-document relationship. This
probability is estimated by a s[OCRerr]called indexing func-
tion u(x[OCRerr]. Different regression methods or probabilistic
classification algorithms can serve as indexing function.
For our retrieval runs submitted to TREC-2, we used
polynomial regression for developing an indexing func-
tion of the form
u(i) =
(1)
2 Document indexing
The task of probabilistic document indexing can be de-
scribed as follows (see [Fuhr & Buckley 91] for more de-
67
where the components of v[OCRerr]x) are products of elements
of i.
The indexing function actually used has the form