SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman Retrieval of Partial Documents Alistair Moffat* Ron Sacks[OCRerr]DaVist Information systems usually retrieve whole doc- uments as answers to queries. However, it may in some circumstances be more appropriate to re- trieve parts of documents. These parts could be formed by arbitrary division of running text into pieces of similar length, or by considering the doc- ument's hierarchical structure. Here we consider how to break documents into parts, how to imple- ment retrieval of parts, and the impact of division of documents on retrieval effectiveness. 1 Introduction Provision of answers to informally phrased que[OCRerr] tions is a central part of information retrieval. These answers traditionally take the form of doc- uments retrieved from a text database, but doc- uments will often be unsatisfactory as answers. They may be large and unwieldy; the answer they represent may be diffuse, and therefore hard for the user to extract; and word-based retrieval sys- tems may be misled by the breadth of vocabu- lary of a long document into believing it to be relevant. Indexing and returning parts of documents ad- dresses these problems. We have approached the problem of partial documents in two ways. The first approach is to regard documents as an un- structured series of "pages" of text of similar length, each of which can be returned as an an- swer to a query. We would expect, under this approach, that any bias in the retrieval mecha- nism towards documents of a particular length should be eliminated. By regarding an answer to be the document from which an answer page is drawn, paging can be used even in contexts where *Dept. of Computer Science, The University of Mel- bourne, Parkvrne, Victoria, Australia 3052; alistair~cs.mu.oz.au tCollaborative Information Technology Research Insti- tute, 723 Swanston St., Cariton, Victoria, Australia 3052; rsd~kbs.citn..edu.au :Dept. of Computer Science, RMIT, GPO Box 2476V, Melbourne, Victoria, Australia 3001; ross[OCRerr]cs.rmit.oz.au §Dept. of Computer Science, RMIT, GPO Box 2476V, Melbourne, Victoria, Australia 3001; jz[OCRerr]cs.rmit.oz.au Ross Wilkinson: Justin Zobel[OCRerr] documents are required as answers, as is the case for the TREC experiments. Breaking documents into pages, however, has implications for implementation: the growth in the number of candidate answers is such that current approaches for evaluating queries have unacceptable memory requirements and response times. We have developed new algorithms for implementing information retrieval methods on large collections, concentrating on the cosine measure with IDE term weights as a typical ex- ample. These include techniques for efficiently constructing and compressing large inverted files, and for restricting the amount of memory space and processing time required during query eval- uation. The result is that we are able to iden- tify answers in a fraction of the space and time required by previous methods. Using these tech- niques on a version of the TREC data in which the documents are broken into over 1.7 million pages, answers can be found more quickly than they could previously be found on the unpaged data, even though the latter has a smaller index and fewer records. Section 2 gives an overview of these techniques. Our second approach to the problem of par- tial documents was to regard documents as hi- erarchical structures. Some of the documents in TREC are very large. It is not clear that it is desirable to return such documents as a whole, nor is it clear that these documents should be in- dexed as a whole. Most of the long documents in TREC are from the Federal Register collection and all have some degree of structure associated with them. We conducted a set of experiments that attempted to determine whether these doc- uments should be indexed as single objects, and whether the documents' structure could be used in conjunction with the contents of its elements. We also experimented with retrieval of par- tial documents and investigated whether context, that is the rank of the whole document, helped improve ranking of sections. The experiments with hierarchical structures are necessarily based on the small set of longer documents in the TREC 181