SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection
chapter
N. Fuhr
U. Pfeifer
C. Bremkamp
M. Pollmann
National Institute of Standards and Technology
D. K. Harman
The algorithm C to index a document set D can now
be given as:
Algorithm C
1 For each document d E D
1.1 For each term t E [OCRerr]
1.1.1 Find values of the relevance de-
scription i(t, d) involved in run.
1.1.2 Give I the weight b. v[OCRerr]x[OCRerr]t, d))
1.1 Add d to the inverted file.
A.2 Ad-hoc runs
The algorithm D is used for indexing and retrieval for
the ad-hoc runs. Steps numbered with a trailing "A"
apply only for run dortq2, steps with trailing "B" only
to run dortL2.
Algorithm D
1 Run al[OCRerr]orithm B to determine the coefficient
vector b for document indexing.
lA Run algorithm A to determine the coefficient
vector a for query indexing.
2 Call algorithm C for document set D1 U D2
3 For each query q[OCRerr] £ Q[OCRerr] do
3.1 For each term t[OCRerr] occuring in q[OCRerr] do
3.1.1A Determine the feature vector Xik
and compute the query term
weight Cik by multiplying it to d..
3.1.1B Weight t[OCRerr] w.r.t. qk (test query
set) with if weights (nnn variant).
Phrases where downweighted by
multiplying the weights with [OCRerr] =
0.15.
3.2 Run an inner product inverted file sim-
ilarity match of c[OCRerr] against the inverted
file formed in step 2, retrieving the top
1000.
A.3 Routing runs
Algorithm E is used for indexing and retrieval for the
routing runs. Steps numbered with a trailing "A" apply
only for run dortPl, steps with trailing "B" only to run
dortVl..
74
Algorithm E
lA Index query set Q2 and document set D1 U D2
with if. idf weights.
lB Index query set Q2 and document set D1 U D2
by calling algorithm C
2 For each query q E Q2
2.1 For each term I [OCRerr] qT (set of query terms)
2.1.1 Reweight term I using the RPI rel-
evance weighting formula andthe
relevance information supplied.
3A Index document set D3 by calling algorithm C.
3B Index document set D3 with if. idf weights.
Note that the collection frequency information
used was derived from occurrences in D1 U D2
only (in actual routing the collection frequencies
within D3 would not be known).
4 Run the reweighted queries of Q2 (step 2)
against the inverted file (step 3), returning the
top 1000 documents for each query.
References
F'iihr, N.; Buckley, C. (1991). A Probabilistic Learn-
ing Approach for Document Indexing. ACM J'rans-
actions on Informalion Systems 9(3), pages 223-248.
Ftihr, N.; Buckley, C. (1993). Optimizing Document
Indexing and Search Term Weighting Based on Prob-
abilistic Models. In: ilarman, D. (ed.): The First
Text ftEtrieval Conference (TREC-1), pages 89-100.
National Institute of Standards and Technology Spe-
cial Publication 500-207, Gaithersburg, Md. 20899.
Fiihr, N.; Pfeifer, U. (1994). Probabilistic Informa-
tion Retrieval as Combination of Abstraction, Induc-
tive Learning and Probabilistic Assumptions. ACM
Transactions on Information Systems 1[OCRerr](1).
Fiihr, N. (1989a) Models for Retrieval with Proba-
bilistic Indexing. Information Processing and Man-
agement [OCRerr]5(1), pages 55-72.
Fiihr, N. (1989b). Optimum Polynomial Retrieval
Functions Based on the Probability Ranking Princi-
ple. ACM Transactions on Information Systems 7(3),
pages 183-204.
Wong, S.; Yao, Y. (1989). A Probability Distribu-
tion Model for Information Retrieval. Information
Processing and Management [OCRerr]5(1), pages 39-53.