SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models
chapter
N. Fuhr
C. Buckley
National Institute of Standards and Technology
Donna K. Harman
The routing experiment format was treated almost exactly as a normal relevance feedback experimental
run. The overall procedure was:
1. Index query set Qi and document set D1 with if idf weights
2. For each query q [OCRerr] Qi
2.1 For each term I £ qT (set of query terms)
2.1.1 Reweight term I using the RPI relevance weighting formula and
the (fragmentary) relevance information supplied.
3 Index document set D2 with if. idf weights. Note that the collection
frequency information used in the idf weight was derived from occurrences
in D1 only (in actual routing the collection frequencies within D2 would
not be known)
4. Run the reweighted queries of Qi (step 2) against the inverted file (step 3),
returning the top 200 documents for each query.
This approach differs from true routing in that
A. All documents of D2 were indexed at once instead of individually indexing them and comparing
at each query sequentially. Thus the indexing/retrieval times obtained for the above algorithm
are pretty meaningless.
B. True routing is really a binary decision most often implemented as a similarity threshold, and
retrieving the top 200 documents (in ranked order) wouldn't normally be done. However, this
difference from true routing is required for evaluation purposes, and was thus required for TREC.
Note that the approach was completely automatic with the query and documents treated as fiat text
(no structure). Differing from the ad-hoc runs above, the queries were indexed including the words
from all topic sections.
It's unknown what effect the fragmentary relevance information had on the query reformulation. The
strength of the effect will depend on whether the top similarity documents according to the if idf
weights used had been judged and included in the fragmentary judgements.
Step 1 took 3.0 hours, step 2 about 1304 seconds, step 3 about 1.9 hours, and step 4 about 312 seconds.
References
Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. ACM
Transactio,,s 01, Information Systems 9(3), pages 223-248.
Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and
Management [OCRerr]5(1), pages 55-72.
Fulir, N. (1992). Integration of Probabilistic Fact and Text Retrieval. In: Belkin, N.; Ingwersen, P.;
Pejtersen, M. (eds.): Proceedings of the Fifieenth Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 211-222. ACM, New York.
Maron, M.; Kuhus, J. (1960). On Relevance, Probabilistic Indexing, and Information Retrieval.
Journal of the ACM 7, pages 216-244
Pfeifer, U. (1990). Development of Log-Linear and Linear-Iterat:ve !ndexing Functions (in German).
Diploma thesis, TH Darmstadt, FB Informatik, Datenverwaltungssysteme II.
Pfeifer, U. (1991). Entwicklung linear iterativer und logistischer Indexierungsfunktionen. In: Fuhr,
N. (ed.): Information Retrieval, pages 23-37. Springer, Berlin et al.
Robertson, S. (1977). The Probability Ranking Principle in IR. Journal of Documentation 33,
pages 294-304.
98