SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman Documents /5 J 10 115120T 25 [OCRerr] 30 15012001 Experiment ________ I ________ lii ________ I ________ 1IJ 1 0.286 0.271 0.248 0.236 0.234 0.229 0.206 0.102 2 0.243 0.221 0.214 0.204 0.191 0.178 0.170 0.092 3 0.271 0.250 0.229 0.221 0.206 0.202 0.184 0.094 [OCRerr]4 0.329 0.257 ifi30.229 0.206 0.202 0.184[OCRerr]0.085 5 0.343 0.264 0.238 0.236 0.234 0.231 0.209 0.099 Table 2: Comparison of ranking formula for fixed number of documents returned Documents/S Experiment ______J[OCRerr][OCRerr] 20 [OCRerr] 50__[OCRerr] 200 J I 0.186 0.164 0.181 0.161 0.160 0.152 0.140 0.120 0.100 0.121 0.105 0.100 0.094 0.088 0.090 0.083 9£ _____I[OCRerr]f r[OCRerr]! f j 0.171 0.121 0.114 0.100 0.091 0.083 0.100 0.082 ______________ 0.214 0.171 0.186 0.189 0.189 0.193 0.179 NA Table 3: Comparison of ranking formula for fixed number of sections returned document ranking was perhaps surprising-other studies have indicated that considering smaller fragments was helpful, although in combination with larger contexts. We wondered whether the section boundaries that had been imposed were inappropriate. We thus applied the pagination techniques described in Section 2.1 to the 4,000 document database used here. These results are shown in Table 4 as Experiment 10. These three experiments are a little difficult to interpret. Dividing documents into sections leads to poorer retrieval performance, but divid- ing further into pages leads to comparable re- trieval performance to ranking whole documents. It may be that the manually supplied divisions are poorer than the divisions generated by au- tomatic techniques [5]. Experiments 2-4 show that some of this performance degradation can be ameliorated by taking documents' structure into account. However, these experiments indi- cate that there is no retrieval advantage in break- ing the document up, should the desired unit of retrieval be whole documents. 4 Conclusions The combination of a restricted-accumulators policy and the introduction of skips to the com- pressed inverted file entries allows fast query evaluation on large text collections. Moreover, the ranking can be carried out within modest amounts of main memory. For example, the 190 paged TREC collection contains 1.7 million pages, but ranked queries of 50 or more terms can be re- solved within seconds using just a few megabytes of main memory. These two techniques mean that large collections can be searched on small machines without measurable degradation in re- trieval effectiveness. In the second part of the experiment we have concentrated on large documents, breaking them into smaller units for the purposes of indexing. It is not clear that users are interested in re- trieving 3 Mb documents, and these experiments were designed to allow users the option of retriev- ing smaller parts of such documents. The results were mixed. It appears that indexing both sec- tions and documents is helpful in ranking sec- tions. However, it is not clear what an appropri- ate indexing strategy is if only full documents are to be returned. We are continuing our investiga- tion of partial document retrieval. Acknowledgements We would like to thank Daniel Lam and Neil Sharman for their assistance with various com- ponents of the experiments described here. This work was supported by the Australian Research Council, the Collaborative Information Technol- ogy Research Institute, and the Centre for Intel- ligent Decision Systems.