NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models chapter N. Fuhr C. Buckley National Institute of Standards and Technology Donna K. Harman The routing experiment format was treated almost exactly as a normal relevance feedback experimental run. The overall procedure was: 1. Index query set Qi and document set D1 with if idf weights 2. For each query q [OCRerr] Qi 2.1 For each term I Ł qT (set of query terms) 2.1.1 Reweight term I using the RPI relevance weighting formula and the (fragmentary) relevance information supplied. 3 Index document set D2 with if. idf weights. Note that the collection frequency information used in the idf weight was derived from occurrences in D1 only (in actual routing the collection frequencies within D2 would not be known) 4. Run the reweighted queries of Qi (step 2) against the inverted file (step 3), returning the top 200 documents for each query. This approach differs from true routing in that A. All documents of D2 were indexed at once instead of individually indexing them and comparing at each query sequentially. Thus the indexing/retrieval times obtained for the above algorithm are pretty meaningless. B. True routing is really a binary decision most often implemented as a similarity threshold, and retrieving the top 200 documents (in ranked order) wouldn't normally be done. However, this difference from true routing is required for evaluation purposes, and was thus required for TREC. Note that the approach was completely automatic with the query and documents treated as fiat text (no structure). Differing from the ad-hoc runs above, the queries were indexed including the words from all topic sections. It's unknown what effect the fragmentary relevance information had on the query reformulation. The strength of the effect will depend on whether the top similarity documents according to the if idf weights used had been judged and included in the fragmentary judgements. Step 1 took 3.0 hours, step 2 about 1304 seconds, step 3 about 1.9 hours, and step 4 about 312 seconds. References Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. ACM Transactio,,s 01, Information Systems 9(3), pages 223-248. Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management [OCRerr]5(1), pages 55-72. Fulir, N. (1992). Integration of Probabilistic Fact and Text Retrieval. In: Belkin, N.; Ingwersen, P.; Pejtersen, M. (eds.): Proceedings of the Fifieenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 211-222. ACM, New York. Maron, M.; Kuhus, J. (1960). On Relevance, Probabilistic Indexing, and Information Retrieval. Journal of the ACM 7, pages 216-244 Pfeifer, U. (1990). Development of Log-Linear and Linear-Iterat:ve !ndexing Functions (in German). Diploma thesis, TH Darmstadt, FB Informatik, Datenverwaltungssysteme II. Pfeifer, U. (1991). Entwicklung linear iterativer und logistischer Indexierungsfunktionen. In: Fuhr, N. (ed.): Information Retrieval, pages 23-37. Springer, Berlin et al. Robertson, S. (1977). The Probability Ranking Principle in IR. Journal of Documentation 33, pages 294-304. 98