NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Effective and Efficient Retrieval from Large and Dynamic Document Collections chapter D. Knaus P. Schauble National Institute of Standards and Technology D. K. Harman Partition Partition APi IEIII APi 11.9 AP2 AP2 1 11.0 DOE1.1 DOEi.i I 4.2 DOE1.2 DOEi.2 14.3 DOE1.3 `III... DOEi.3 [OCRerr]i.i FRi FRi 14.4 FR2 [`Eli - FR2 I 3.4 WSJi WSJi *ii.9 WSJ2 WSJ2 18.6 ZIFFi ZIFFi 8.0 ZIFF2 ZIFF2 :5.9 I I I I I I I I I I 0 1 2 3 4 5 6 7 8 9 10 sec. Figure 4: Response times of the first ranked document per run (adhoc queries versus the listed partitions) for the fastest method (M8 nltc.ntc.dfiS). weighting because of the scaling with the midf([OCRerr]). On the other hand, the approximation error for the "ntc" weighting is very large compared to both the "ltc" and the Ł`lnc" weighting because of the normal- ized linear weighting instead of the normalized loga- rithmic weighting. The box plots [7, pp.336] presented in Figure 4 show the distribution of the response times required to deter- mine the top ranked document. A box plot visualizes the median (the line within the box), half of all samples (the box), and outliers (the dots) of a sample collection. In our case we have 50 samples: the response times of the 50 adhoc queries run against a partition. For most queries the top ranked document was retrieved in less than two seconds. The few outliers were all produced by the same queries (#103, #136, #138, #144). In gen- eral, the response times become shorter if a partition contains less documents and if the document descrip- tions consist of less postings on an average (see Fig- ure 5). 5 Conclusions In our approach we stressed the update efficiency. We have shown that the retrieval effectiveness does not have to be sacrificed to achieve a high update efficiency when coping with highly dynamic document collections. Our approach could probably be further improved by find- 168 0 5 10 *106 postings Figure 5: Number of postings per partition. mg a weighting scheme that, on the one hand, achieves a very good retrieval effectiveness and that, on the other hand, can be approximated by frequency independent weights with only little variation from the exact weights. The retrieval efficiency could be improved by better partitioning the document collections according to the lengths of the documents. Our approach seems to be very amenable to parallel processing. We may think of several configurations (partitioning the query, partition- ing the document collection, etc.). At this moment, it is not clear which configuration is appropriate for which requirements. Furthermore, we do not know yet how dynamic the document collection must be such that our access structure outperforms inverted files. References [1] C. Buckley, G. Salton, and J. Allan. Automatic Re- trieval With Locality Information Using SMART. In TREC-1 Proceedi[OCRerr]g8, pages 59-72,1992. [2] W. B. Croft and P. Savino. Implementing Ranking Strategies Using Text Signatures. ACM Tra[OCRerr]ac- iio[OCRerr] OIL IILfOrm[OCRerr]iOIL S[OCRerr]8tem5, 6(i):42-62, 1988. [3] U. Glavitsch and P. Schauble. A System for Re- trieving Speech Documents. In N. Belkin, P. In- gwersen, and A. M. Pejtersen, editors, ACM SI- CIR CoifereILce OIL R[OCRerr]D jIL IILformatioIL Retrieval, pages 168-176, 1992. [4] D. Harman. Relevance Feedback Revisited. In N. Belkin, P. Ingwersen, and A. M. Pejtersen, ed-