SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection
chapter
N. Fuhr
U. Pfeifer
C. Bremkamp
M. Pollmann
National Institute of Standards and Technology
D. K. Harman
4.3 Official runs procedures, this combination seems to be a prospective
Two different runs were submitted for the routing
queries, both based on the RPI model.
Run dortPi uses the same document indexing function
as for the adhoc queries. Query terms were weighted
according to the RPI formula. In addition, each query
was expanded by 20 single words. Phrases were not
downweighted.
Run dortVl is based on ltc document indexing. Here
no query expansion took place.
area of research.
A Operational details of runs
A.1 Basic Algorithms
The algorithm A to find the coefficient vector a for the
ad-hoc query term weights can be given as follows:
run
document indexing
query expansion
I
dortVl dortPl
ltc lsp
none 20 terms
average precision:
Prec. Avg. [OCRerr] 0.3516 0.3800
query-wise comparison with median:
Prec. Avg. 38:10 46:4
Prec. © 100 docs 31:11 40:5
Prec. © 1000 docs 32:9 37:7
Best/worst results:
Prec. Avg. 1/0 4(2)10
Prec. © 100 docs 3(3)11(1) 7(5)11(1)
Prec. © 1000 docs 6(2)/0(1) 10(2)/0(1)
dortVl vs. dortPl:
Prec. Avg. 10:39
Prec. © 100 docs 9:27
Prec. © 1000 docs 7:33
Table 10: Results for routing queries
Table 10 shows the results for the two runs. The recall-
precision curves are given in figure 2. Again, the results
confirm our expectations that LSP indexing and query
expansion yields better results.
5 Conclusions and outlook
The experiments described in this paper have shown
that probabilistic learning approaches can be applied
successfully to different types of indexing and retrieval.
For the ad-hoc queries, there seems to be still room for
further improvement in the low recall range. In order to
increase precision, a passage-wise comparison of query
and document text should be performed. For this pur-
pose, polynomial retrieval functions could be applied.
In the case of the routing queries, we first have to inves-
tigate methods for parameter estimation in combination
with query expansion. However, with the large number
of feedback documents given for this task, other types
of retrieval models may be more suitable, e.g. query-
specific polynomial retrieval functions.
Finally, it should be emphasized that we still use rather
simple forms of text analysis. Since our methods are
flexible enough to work with more sophisticated analysis
73
Algorithm A
1 For each query document pair (qk, dm) [OCRerr]
(Qi U Q2) x D8 with D8 being a sample from
(D1 UD2) do
1.1 determine the relevance value [OCRerr]km of
the document dm with respect to the
query qk.
1.2 For each term t[OCRerr] occuring in q[OCRerr] do
1.2.1 determine the feature vector [OCRerr]j
and the indexing weight Uim of the
term t[OCRerr] w.r.t. to document dm.
1.3 For each feature j of the feature vectors
x compute the value of YJ looping over
the terms of the query.
1.4 Add vector x and relevance value r[OCRerr]m
to the least squares matrix.
the least squares matrix to find the co-
2 Solve
efficient vector a
The algorithm B to find the coefficient vector b for the
document indexing is sketched here:
Algorithm B
1 Index D1 U D2 (the learning document set) and
Qi U Q2 (the learning query set).
2 For each document d [OCRerr] D1 U D2
2.1 For each q [OCRerr] Qi U Q2
2.1.1 Determine the relevance value r of
d to q
2.1.2 For each term t in common be-
tween qT (set of query terms) and
[OCRerr]T (set of document terms)
2.1.2.1 Find values of the ele-
ments of the relevance
description involved in
this run and add values
plus relevance informa-
tion to the least squares
matrix being constructed
3 Solve the least squares matrix to find the coef-
ficient vector b