SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection chapter N. Fuhr U. Pfeifer C. Bremkamp M. Pollmann National Institute of Standards and Technology D. K. Harman The algorithm C to index a document set D can now be given as: Algorithm C 1 For each document d E D 1.1 For each term t E [OCRerr] 1.1.1 Find values of the relevance de- scription i(t, d) involved in run. 1.1.2 Give I the weight b. v[OCRerr]x[OCRerr]t, d)) 1.1 Add d to the inverted file. A.2 Ad-hoc runs The algorithm D is used for indexing and retrieval for the ad-hoc runs. Steps numbered with a trailing "A" apply only for run dortq2, steps with trailing "B" only to run dortL2. Algorithm D 1 Run al[OCRerr]orithm B to determine the coefficient vector b for document indexing. lA Run algorithm A to determine the coefficient vector a for query indexing. 2 Call algorithm C for document set D1 U D2 3 For each query q[OCRerr] £ Q[OCRerr] do 3.1 For each term t[OCRerr] occuring in q[OCRerr] do 3.1.1A Determine the feature vector Xik and compute the query term weight Cik by multiplying it to d.. 3.1.1B Weight t[OCRerr] w.r.t. qk (test query set) with if weights (nnn variant). Phrases where downweighted by multiplying the weights with [OCRerr] = 0.15. 3.2 Run an inner product inverted file sim- ilarity match of c[OCRerr] against the inverted file formed in step 2, retrieving the top 1000. A.3 Routing runs Algorithm E is used for indexing and retrieval for the routing runs. Steps numbered with a trailing "A" apply only for run dortPl, steps with trailing "B" only to run dortVl.. 74 Algorithm E lA Index query set Q2 and document set D1 U D2 with if. idf weights. lB Index query set Q2 and document set D1 U D2 by calling algorithm C 2 For each query q E Q2 2.1 For each term I [OCRerr] qT (set of query terms) 2.1.1 Reweight term I using the RPI rel- evance weighting formula andthe relevance information supplied. 3A Index document set D3 by calling algorithm C. 3B Index document set D3 with if. idf weights. Note that the collection frequency information used was derived from occurrences in D1 U D2 only (in actual routing the collection frequencies within D3 would not be known). 4 Run the reweighted queries of Q2 (step 2) against the inverted file (step 3), returning the top 1000 documents for each query. References F'iihr, N.; Buckley, C. (1991). A Probabilistic Learn- ing Approach for Document Indexing. ACM J'rans- actions on Informalion Systems 9(3), pages 223-248. Ftihr, N.; Buckley, C. (1993). Optimizing Document Indexing and Search Term Weighting Based on Prob- abilistic Models. In: ilarman, D. (ed.): The First Text ftEtrieval Conference (TREC-1), pages 89-100. National Institute of Standards and Technology Spe- cial Publication 500-207, Gaithersburg, Md. 20899. Fiihr, N.; Pfeifer, U. (1994). Probabilistic Informa- tion Retrieval as Combination of Abstraction, Induc- tive Learning and Probabilistic Assumptions. ACM Transactions on Information Systems 1[OCRerr](1). Fiihr, N. (1989a) Models for Retrieval with Proba- bilistic Indexing. Information Processing and Man- agement [OCRerr]5(1), pages 55-72. Fiihr, N. (1989b). Optimum Polynomial Retrieval Functions Based on the Probability Ranking Princi- ple. ACM Transactions on Information Systems 7(3), pages 183-204. Wong, S.; Yao, Y. (1989). A Probability Distribu- tion Model for Information Retrieval. Information Processing and Management [OCRerr]5(1), pages 39-53.