SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman collection. We would expect that, in a large database of such documents, the implementation techniques we developed for unstructured text could be used for structured documents almost without change. 2 Retrieval of paged text The first issue that must be considered when ac- cessing a set of documents as a collection of pages is the pagination strategy; one possible method for pagination is discussed in Section 2.1. Then, given paged text, answers must be found in response to documents. In an inverted file text database system of the kind we have been devel- oping [6, 9], the implications of paging are a large increase both in the number of accumulators used to hold the intermediate cosine values and in the length of the inverted file entries. The former means that evaluating a ranking will require more memory; while the latter implies a longer time to resolve queries. Methods for avoiding these in- creases are outlined in Sections 2.2 and 2.3; a full description is given elsewhere [10]. Finally, there is the impact on effectiveness of the decision to paginate the documents. Ex- perimental results comparing document retrieval with page retrieval are described in Section 2.4. 2.1 Breaking text into pages One way to store text in a database is as a set of records, each containing a single document. Such an organisation implies that entire docu- ments must be retrieved in response to queries. Moreover, a query can match long documents in which its words are widely separated and could well be unrelated, so this storage strategy can re- sult in irrelevant material being retrieved. An alternative way of presenting text is as a series of pages. There are several advantages to using pages: * they are of a more manageable size than whole documents, and the size variation be- tween pages can be constrained to be far less than the size variation between documents; * if a user looks for answers matching a query, it is likely that in relevant material the query's words will occur close together in the retrieved text; * retrieving a page of text to display may be considerably cheaper than retrieving an en- tire document; 1. Set prev undefined and curr Wi. 2. For each j from 2 to n, (a) If curr > B, emit prev if it is defined, set prev curr, and set curr Wj. (b) Otherwise, if W, > prev then set prev prev + curr and set curr w3. (c) Otherwise, set curr curr + w3. 3. If curr < B then set prev prev + curr and emit prev. Otherwise, emit prev followed by curr. Figure 1: Paging algorithm * in many applications it is natural to regard documents as consisting of parts rather than as a whole; for example, the nodes in a hy- pertext system or the sections of a book; and * only parts of documents can, in general, fit on a screen, and people can only comprehend part of a document at a time. For these reasons-and because of the challenge posed by the sheer size of a paged version of the TREC collection-experiments were undertaken with paged text. The paging strategy adopted first breaks doc- uments into minimal units likely to be useful for ranking. In most collections this unit would prob- ably be the paragraph. The algorithm shown in Figure 1 is then used to gather paragraphs into pages, where W[OCRerr] is the length of the i'th para- graph of the original document and B is the tar- get length of each constructed page. Documents of fewer than B bytes must, of course, be allo- cated singleton pages, but all other pages are B bytes or more. Moreover, the aim is that only a small number of pages are significantly longer than B bytes. The algorithm requires time linear in the length of the text and a constant amount of space, and so is a small additional step during database construction. For the experiments de- scribed below, length was measured in bytes (the alternatives being words, or some more abstract length value based upon term frequencies), and B = 1,000 was used. The pagination of the 2,055 Mb TREC collec- tion resulted in the 742,358 documents being con- verted into 1,743,848 pages of average size 1,235 bytes each. 182