SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection
chapter
N. Fuhr
U. Pfeifer
C. Bremkamp
M. Pollmann
National Institute of Standards and Technology
D. K. Harman
b0
b1 * is[OCRerr]single
b2 is[OCRerr]single
b3 is[OCRerr]single
b4 lQqnumierms
b5 * is[OCRerr]phrase
b6 * is[OCRerr]phrase
* if * logidf imaxif +
imaxif +
* if
* logid!
* imaxif +
* if * logidf * imaxif +
* if imaxif +
b7 is[OCRerr]phrase logidf.
mula
The coefficient vector b is computed based on a training
sample of query-document-pairs with relevance judge-
ments
Since polynomial functions may yield results outside the
interval [0,1], these values were mapped onto the corre-
sponding boundaries of this interval.
For each phrase occurring in a document, indexing
weights for the phrase as well as for its two components
(as single words) were computed.
There are two major problems with this approach which
we are investigating currently:
1. Which factors should be used for defining the in-
dexing functions? We are developing a tool that
supports a statistical analysis of single factors for
this purpose.
2. What is the best type of indexing function? Pre-
vious experiments have suggested that regression
methods outperform other probabilistic classifica-
tion methods. As a reasonable alternative to poly-
nomial regression, logistic regression seems to offer
some advantages (see also [Fuhr & Pfeifer 94]). As
a major benefit, logistic functions yield only values
between 0 and 1, 50 there is no problem with out-
liers. We are performing experiments with logistic
regression and compare the results to those based
on polynomial regression.
3 Query term weighting for ad-
hoc queries
3.1 Theoretical background
The basis of our query term weighting scheme for ad-hoc
queries is the linear utility-theoretic retrieval function
described in [Wong & Yao 89]. Let qT[OCRerr] denote the set
of terms occurring in the query, and Uim the indexing
weight [OCRerr](x}i[OCRerr], dm)) (with Uim = 0 for terms i[OCRerr] not oc-
curring in dm). If Cik gives the utility of term i[OCRerr] for the
actual query q[OCRerr], then the utility of document dm w.r.t.
query q[OCRerr] can be computed by the retrieval function
Q(qk,dm)= [OCRerr] Cik[OCRerr]ILim.
t
For the estimation of the utility weights Cik, we applied
two different methods.
As a heuristic approach, we used tf weights (the num-
+ ber of occurrences of the term i[OCRerr] in the query), which
had shown good results in the experiments described in
[Fuhr & Buckley 91].
+ As a second method, we applied linear regression to this
problem. Based on the concept of polynomial retrieval
functions as described in [Fuhr 89b], one can estimate
the probability of relevance of q[OCRerr] w.r.t. dm by the for-
P(RIqk, dm) [OCRerr] Cik [OCRerr]im
t[OCRerr]EqT[OCRerr]
(3)
If we had relevance feedback data for the specific query
(as is the case for the routing queries), this function
could be used directly for regression. For the ad-hoc-
queries, however, we have only feedback information
about other queries. For this reason, we regard query
features instead of specific queries. This can be done
by considering for each query term the same features as
described before in the context of document indexing.
Assume that we have a set of features {fo,fi,..., fi}
and that x5[OCRerr] denotes the value of feature fj for term
i[OCRerr] Then we assume that query term weight Cik can be
estimated by linear regression according to the formula
1
Cik = [OCRerr]1[OCRerr]0a5x5[OCRerr]. (4)
Here the factor IqTk serves for the purpose of normaliza-
tion across different queries, since queries with a larger
number of terms tend to yield higher retrieval status val-
ues with formula 2. The factors a5 are the coefficients
that are to be derived by means of regression. Now we
have the problem that regression cannot be applied to
eqn 4, since we do not observe Cik values directly. In-
stead, we observe relevance judgements. This leads us
back to the polynomial retrieval function 3, where we
substitute eqn 4 for c[OCRerr]k:
P(Rqk,d,,,) [OCRerr]
t EqT[OCRerr] ([OCRerr]`0aixii) Uim
with
= Za3[OCRerr]jjT[OCRerr]1xiiuim
[OCRerr]=0 t.Eq[OCRerr]T
= [OCRerr]a5y5 (5)
5=0
1
yj = [OCRerr] [OCRerr]qT[OCRerr]1xiiuirn (6)
t
(2) Equation 5 shows that we can apply linear regression
of the form P(RIqk, dm) [OCRerr] a[OCRerr] [OCRerr] to a training sample
68