NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models chapter N. Fuhr C. Buckley National Institute of Standards and Technology Donna K. Harman probabilistic document indexing in order to estimate query term weights. Let Pik denote the average indexing weight of term i[OCRerr] in the documents judged relevant w.r.t. query q[OCRerr] and r[OCRerr]k the average indexing weight of i[OCRerr] in the nonrelevant documents. Now the query term weight is computed by the formula = Pik(1 - rik) rik(l-Pik) -1 and the RPI retrieval function yields [OCRerr]qk,dm)= [OCRerr] log(cikuim + 1). (2) tiE q[OCRerr]TndT[OCRerr] In our experiments, due to the lack of time, we used the standard SMART if idf document indexing here (Salton & Buckley 88] (single words only) instead of the probabilistic indexing described above. After an initial retrieval run with both if. idf weights for queries and documents, relevance feedback information was used for computing the feedback query term weights Cik. Only the terms occurring in the query were considered here, so no query expansion took place. Theoretically, the RPI formula can also be applied in case of query expansion. However, the additional terms should be treated differently when estimating their query term weights. This problem has not been investigated yet for the RPI model. # queries: 49 query-wise comparison with median: 11-pt Avg: 43:5 Prec. @ 100 docs 43:5 Best/worst results: 11-pt Avg: 12/1 Prec. C[OCRerr] 100 docs 16/1 Tabelle 2: Results for routing queries Table 2 shows the results for the run [OCRerr]uhra2 with routing queries. It can be seen that this approach works very well for almost every query. The single worst result is for topic # 50, where we did not retrieve any relevant document; this outcome is due to the fact that there was only one rele- vant document in the training sample, which is obviously not sufficient for probabilistic parameter estimation. A Operational details of runs A.1 Overall All runs were done completely automatically, treating the text portions of both documents and queries as fiat text without structure. This made everything much simpler, but was not ideal given the complexity in both form and meaning of the queries. Recall-precision definitely suffered. All actual indexing (as opposed to weighting) and retrieval was done with the Cornell SMART Version 11.0 system, using the standard SMART procedures (eg, stopwords, stemming, inverted file retrieval). All runs were made on a Sun Sparc 2 with 64 Mbytes of memory. All times reported are CPU time. 95