SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Effective and Efficient Retrieval from Large and Dynamic Document Collections
chapter
D. Knaus
P. Schauble
National Institute of Standards and Technology
D. K. Harman
Partition Partition
APi IEIII APi 11.9
AP2 AP2 1 11.0
DOE1.1 DOEi.i I 4.2
DOE1.2 DOEi.2 14.3
DOE1.3 `III... DOEi.3 [OCRerr]i.i
FRi FRi 14.4
FR2 [`Eli - FR2 I 3.4
WSJi WSJi *ii.9
WSJ2 WSJ2 18.6
ZIFFi ZIFFi 8.0
ZIFF2 ZIFF2 :5.9
I I I I I I I I I I
0 1 2 3 4 5 6 7 8 9 10 sec.
Figure 4: Response times of the first ranked document
per run (adhoc queries versus the listed partitions) for
the fastest method (M8 nltc.ntc.dfiS).
weighting because of the scaling with the midf([OCRerr]).
On the other hand, the approximation error for the
"ntc" weighting is very large compared to both the
"ltc" and the £`lnc" weighting because of the normal-
ized linear weighting instead of the normalized loga-
rithmic weighting.
The box plots [7, pp.336] presented in Figure 4 show
the distribution of the response times required to deter-
mine the top ranked document. A box plot visualizes
the median (the line within the box), half of all samples
(the box), and outliers (the dots) of a sample collection.
In our case we have 50 samples: the response times of
the 50 adhoc queries run against a partition. For most
queries the top ranked document was retrieved in less
than two seconds. The few outliers were all produced by
the same queries (#103, #136, #138, #144). In gen-
eral, the response times become shorter if a partition
contains less documents and if the document descrip-
tions consist of less postings on an average (see Fig-
ure 5).
5 Conclusions
In our approach we stressed the update efficiency. We
have shown that the retrieval effectiveness does not have
to be sacrificed to achieve a high update efficiency when
coping with highly dynamic document collections. Our
approach could probably be further improved by find-
168
0
5
10 *106
postings
Figure 5: Number of postings per partition.
mg a weighting scheme that, on the one hand, achieves a
very good retrieval effectiveness and that, on the other
hand, can be approximated by frequency independent
weights with only little variation from the exact weights.
The retrieval efficiency could be improved by better
partitioning the document collections according to the
lengths of the documents. Our approach seems to be
very amenable to parallel processing. We may think of
several configurations (partitioning the query, partition-
ing the document collection, etc.). At this moment, it is
not clear which configuration is appropriate for which
requirements. Furthermore, we do not know yet how
dynamic the document collection must be such that our
access structure outperforms inverted files.
References
[1] C. Buckley, G. Salton, and J. Allan. Automatic Re-
trieval With Locality Information Using SMART.
In TREC-1 Proceedi[OCRerr]g8, pages 59-72,1992.
[2] W. B. Croft and P. Savino. Implementing Ranking
Strategies Using Text Signatures. ACM Tra[OCRerr]ac-
iio[OCRerr] OIL IILfOrm[OCRerr]iOIL S[OCRerr]8tem5, 6(i):42-62, 1988.
[3] U. Glavitsch and P. Schauble. A System for Re-
trieving Speech Documents. In N. Belkin, P. In-
gwersen, and A. M. Pejtersen, editors, ACM SI-
CIR CoifereILce OIL R[OCRerr]D jIL IILformatioIL Retrieval,
pages 168-176, 1992.
[4] D. Harman. Relevance Feedback Revisited. In
N. Belkin, P. Ingwersen, and A. M. Pejtersen, ed-