SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models
chapter
N. Fuhr
C. Buckley
National Institute of Standards and Technology
Donna K. Harman
probabilistic document indexing in order to estimate query term weights. Let Pik denote the average
indexing weight of term i[OCRerr] in the documents judged relevant w.r.t. query q[OCRerr] and r[OCRerr]k the average
indexing weight of i[OCRerr] in the nonrelevant documents. Now the query term weight is computed by the
formula
= Pik(1 - rik)
rik(l-Pik) -1
and the RPI retrieval function yields
[OCRerr]qk,dm)= [OCRerr] log(cikuim + 1). (2)
tiE q[OCRerr]TndT[OCRerr]
In our experiments, due to the lack of time, we used the standard SMART if idf document indexing
here (Salton & Buckley 88] (single words only) instead of the probabilistic indexing described above.
After an initial retrieval run with both if. idf weights for queries and documents, relevance feedback
information was used for computing the feedback query term weights Cik. Only the terms occurring in
the query were considered here, so no query expansion took place. Theoretically, the RPI formula can
also be applied in case of query expansion. However, the additional terms should be treated differently
when estimating their query term weights. This problem has not been investigated yet for the RPI
model.
# queries: 49
query-wise comparison with median:
11-pt Avg: 43:5
Prec. @ 100 docs 43:5
Best/worst results:
11-pt Avg: 12/1
Prec. C[OCRerr] 100 docs 16/1
Tabelle 2: Results for routing queries
Table 2 shows the results for the run [OCRerr]uhra2 with routing queries. It can be seen that this approach
works very well for almost every query. The single worst result is for topic # 50, where we did
not retrieve any relevant document; this outcome is due to the fact that there was only one rele-
vant document in the training sample, which is obviously not sufficient for probabilistic parameter
estimation.
A Operational details of runs
A.1 Overall
All runs were done completely automatically, treating the text portions of both documents and queries
as fiat text without structure. This made everything much simpler, but was not ideal given the
complexity in both form and meaning of the queries. Recall-precision definitely suffered.
All actual indexing (as opposed to weighting) and retrieval was done with the Cornell SMART Version
11.0 system, using the standard SMART procedures (eg, stopwords, stemming, inverted file retrieval).
All runs were made on a Sun Sparc 2 with 64 Mbytes of memory. All times reported are CPU time.
95