NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman file Retrieval Time TI Non-zero (Mb) II (cpu sec) TI accumulators ____________ ______ Page Doc. [OCRerr] Page Doc. ] Page L = 1,000 IJ 145.8J205.4 II 2.36 ` 2.95 11 2,330 2,701 L = 10,000 11156.71220.311 4.46 [OCRerr] 5.57 [OCRerr] 12,855 13,238 L = 100,000 11161.61230.211 9.80 [OCRerr] 12.00 11107,737 109,370 All (array) 11128.61184.411 7.66 [OCRerr] 12.48 II 582,783 1,307,115 (a) Resources required Terms Precision at 200 JI Pessimal lipt processed ________ ________TI Doc. average [OCRerr] _____________ ______ Page Doc. [OCRerr] Page _________ [OCRerr] Page L = 1,000 f[ 3.08 J 2.84 II 0.271-0.612 [OCRerr] 0.260-0.639 TI 0.164 0.166 L = 10,000 [OCRerr] 6.80 [OCRerr] 6.06 0.346-0.530 0.319-0.554 [OCRerr] 0.185 0.173 L = 100,000 [OCRerr] 16.84 14.26 [OCRerr] 0.333-0.446 [OCRerr] 0.326-0.504 [OCRerr] 0.175 0.175 All (array) 11 42.44 [OCRerr] 42.44 J[ 0.331-0.444 j 0.321-0.502 `1 0.174 0.173 ) (b, Re trieval [OCRerr]fiectiveness Table 1: Document vs. paragraph retrieval: (a) resources required, and (b) retrieval effectiveness quired, on average, to consume the available accumulators. Precision at 200 The range of values in this column is to show the fraction of unjudged documents that were accessed. The lower number assumes that all unjudged docu- ments are irrelevant; the upper is calculated assuming that all are relevant. lipt effectiveness The "1 ipt average" values are calculated based solely upon the top 200 ranked documents, assuming that all remain- ing relevant documents are ranked last, and that all unjudged documents are not rele- vant. Index construction New algorithms have been developed for index construction. Using these algorithms, the index was built in un- der 4 hours, using a peak of 40 Mb of main memory and less than 50 Mb of temporary disk space above and beyond the final size of the inverted file. The compression of the documents took a further 4 hours, for a total database build time of under 8 hours. Based upon these experiments we conclude that: * use of a limited number of accumulators does not appear to impact retrieval effectiveness, and so is an extremely attractive heuristic for ranking large collections because of the dra- matic savings in retrieval-time memory us- age that result; 187 * introduction of skips to the compressed in- verted file entries significantly reduces pro- cessing time in this restricted-accumulators environment; * index compression can, in this way, become "free": if only partial decoding is required, the input time saved by compression can be more than enough to pay the cpu cost offrac- tional decoding; * all non-stopped terms should be allowed to contribute to the ranking, and that the quit strategy is inferior to continue; and * even relatively simple pagination gives re- trieval not measurably inferior to document retrieval. When dealing with pages, the retrieval system was implemented to display the entire document, but highlight the page or pages that had caused the document to be ranked highly. This deci- sion made the paged collection particularly easy to use, and when we browsed the answers to the queries it was very satisfying to be able to find the exact text that had triggered the match. This advantage of pagination became clear even in the early stages of the project, while we were attempting to debug the implementation of the cosine method. On the other hand, a problem that was brought out by these experiments was the dif- ficulty of comparing retrieval effectiveness in