SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
collection. We would expect that, in a large
database of such documents, the implementation
techniques we developed for unstructured text
could be used for structured documents almost
without change.
2 Retrieval of paged text
The first issue that must be considered when ac-
cessing a set of documents as a collection of pages
is the pagination strategy; one possible method
for pagination is discussed in Section 2.1.
Then, given paged text, answers must be found
in response to documents. In an inverted file text
database system of the kind we have been devel-
oping [6, 9], the implications of paging are a large
increase both in the number of accumulators used
to hold the intermediate cosine values and in the
length of the inverted file entries. The former
means that evaluating a ranking will require more
memory; while the latter implies a longer time to
resolve queries. Methods for avoiding these in-
creases are outlined in Sections 2.2 and 2.3; a full
description is given elsewhere [10].
Finally, there is the impact on effectiveness
of the decision to paginate the documents. Ex-
perimental results comparing document retrieval
with page retrieval are described in Section 2.4.
2.1 Breaking text into pages
One way to store text in a database is as a set
of records, each containing a single document.
Such an organisation implies that entire docu-
ments must be retrieved in response to queries.
Moreover, a query can match long documents in
which its words are widely separated and could
well be unrelated, so this storage strategy can re-
sult in irrelevant material being retrieved.
An alternative way of presenting text is as a
series of pages. There are several advantages to
using pages:
* they are of a more manageable size than
whole documents, and the size variation be-
tween pages can be constrained to be far less
than the size variation between documents;
* if a user looks for answers matching a query,
it is likely that in relevant material the
query's words will occur close together in the
retrieved text;
* retrieving a page of text to display may be
considerably cheaper than retrieving an en-
tire document;
1. Set prev undefined and curr Wi.
2. For each j from 2 to n,
(a) If curr > B, emit prev if it is defined,
set prev curr, and set curr Wj.
(b) Otherwise, if W, > prev then set
prev prev + curr and set curr w3.
(c) Otherwise, set curr curr + w3.
3. If curr < B then set prev prev + curr and
emit prev. Otherwise, emit prev followed by
curr.
Figure 1: Paging algorithm
* in many applications it is natural to regard
documents as consisting of parts rather than
as a whole; for example, the nodes in a hy-
pertext system or the sections of a book; and
* only parts of documents can, in general, fit
on a screen, and people can only comprehend
part of a document at a time.
For these reasons-and because of the challenge
posed by the sheer size of a paged version of the
TREC collection-experiments were undertaken
with paged text.
The paging strategy adopted first breaks doc-
uments into minimal units likely to be useful for
ranking. In most collections this unit would prob-
ably be the paragraph. The algorithm shown in
Figure 1 is then used to gather paragraphs into
pages, where W[OCRerr] is the length of the i'th para-
graph of the original document and B is the tar-
get length of each constructed page. Documents
of fewer than B bytes must, of course, be allo-
cated singleton pages, but all other pages are B
bytes or more. Moreover, the aim is that only
a small number of pages are significantly longer
than B bytes. The algorithm requires time linear
in the length of the text and a constant amount
of space, and so is a small additional step during
database construction. For the experiments de-
scribed below, length was measured in bytes (the
alternatives being words, or some more abstract
length value based upon term frequencies), and
B = 1,000 was used.
The pagination of the 2,055 Mb TREC collec-
tion resulted in the 742,358 documents being con-
verted into 1,743,848 pages of average size 1,235
bytes each.
182