SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
Retrieval of Partial Documents
Alistair Moffat* Ron Sacks[OCRerr]DaVist
Information systems usually retrieve whole doc-
uments as answers to queries. However, it may
in some circumstances be more appropriate to re-
trieve parts of documents. These parts could be
formed by arbitrary division of running text into
pieces of similar length, or by considering the doc-
ument's hierarchical structure. Here we consider
how to break documents into parts, how to imple-
ment retrieval of parts, and the impact of division
of documents on retrieval effectiveness.
1 Introduction
Provision of answers to informally phrased que[OCRerr]
tions is a central part of information retrieval.
These answers traditionally take the form of doc-
uments retrieved from a text database, but doc-
uments will often be unsatisfactory as answers.
They may be large and unwieldy; the answer they
represent may be diffuse, and therefore hard for
the user to extract; and word-based retrieval sys-
tems may be misled by the breadth of vocabu-
lary of a long document into believing it to be
relevant.
Indexing and returning parts of documents ad-
dresses these problems. We have approached the
problem of partial documents in two ways. The
first approach is to regard documents as an un-
structured series of "pages" of text of similar
length, each of which can be returned as an an-
swer to a query. We would expect, under this
approach, that any bias in the retrieval mecha-
nism towards documents of a particular length
should be eliminated. By regarding an answer to
be the document from which an answer page is
drawn, paging can be used even in contexts where
*Dept. of Computer Science, The University of Mel-
bourne, Parkvrne, Victoria, Australia 3052;
alistair~cs.mu.oz.au
tCollaborative Information Technology Research Insti-
tute, 723 Swanston St., Cariton, Victoria, Australia 3052;
rsd~kbs.citn..edu.au
:Dept. of Computer Science, RMIT, GPO Box 2476V,
Melbourne, Victoria, Australia 3001; ross[OCRerr]cs.rmit.oz.au
§Dept. of Computer Science, RMIT, GPO Box 2476V,
Melbourne, Victoria, Australia 3001; jz[OCRerr]cs.rmit.oz.au
Ross Wilkinson: Justin Zobel[OCRerr]
documents are required as answers, as is the case
for the TREC experiments.
Breaking documents into pages, however, has
implications for implementation: the growth in
the number of candidate answers is such that
current approaches for evaluating queries have
unacceptable memory requirements and response
times. We have developed new algorithms for
implementing information retrieval methods on
large collections, concentrating on the cosine
measure with IDE term weights as a typical ex-
ample. These include techniques for efficiently
constructing and compressing large inverted files,
and for restricting the amount of memory space
and processing time required during query eval-
uation. The result is that we are able to iden-
tify answers in a fraction of the space and time
required by previous methods. Using these tech-
niques on a version of the TREC data in which
the documents are broken into over 1.7 million
pages, answers can be found more quickly than
they could previously be found on the unpaged
data, even though the latter has a smaller index
and fewer records. Section 2 gives an overview of
these techniques.
Our second approach to the problem of par-
tial documents was to regard documents as hi-
erarchical structures. Some of the documents in
TREC are very large. It is not clear that it is
desirable to return such documents as a whole,
nor is it clear that these documents should be in-
dexed as a whole. Most of the long documents
in TREC are from the Federal Register collection
and all have some degree of structure associated
with them. We conducted a set of experiments
that attempted to determine whether these doc-
uments should be indexed as single objects, and
whether the documents' structure could be used
in conjunction with the contents of its elements.
We also experimented with retrieval of par-
tial documents and investigated whether context,
that is the rank of the whole document, helped
improve ranking of sections. The experiments
with hierarchical structures are necessarily based
on the small set of longer documents in the TREC
181