SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
file Retrieval Time TI Non-zero
(Mb) II (cpu sec) TI accumulators
____________ ______ Page Doc. [OCRerr] Page Doc. ] Page
L = 1,000 IJ 145.8J205.4 II 2.36 ` 2.95 11 2,330 2,701
L = 10,000 11156.71220.311 4.46 [OCRerr] 5.57 [OCRerr] 12,855 13,238
L = 100,000 11161.61230.211 9.80 [OCRerr] 12.00 11107,737 109,370
All (array) 11128.61184.411 7.66 [OCRerr] 12.48 II 582,783 1,307,115
(a) Resources required
Terms Precision at 200 JI Pessimal lipt
processed ________ ________TI Doc. average [OCRerr]
_____________ ______ Page Doc. [OCRerr] Page _________ [OCRerr] Page
L = 1,000 f[ 3.08 J 2.84 II 0.271-0.612 [OCRerr] 0.260-0.639 TI 0.164 0.166
L = 10,000 [OCRerr] 6.80 [OCRerr] 6.06 0.346-0.530 0.319-0.554 [OCRerr] 0.185 0.173
L = 100,000 [OCRerr] 16.84 14.26 [OCRerr] 0.333-0.446 [OCRerr] 0.326-0.504 [OCRerr] 0.175 0.175
All (array) 11 42.44 [OCRerr] 42.44 J[ 0.331-0.444 j 0.321-0.502 `1 0.174 0.173
)
(b, Re trieval [OCRerr]fiectiveness
Table 1: Document vs. paragraph retrieval: (a) resources required, and (b) retrieval effectiveness
quired, on average, to consume the available
accumulators.
Precision at 200 The range of values in this
column is to show the fraction of unjudged
documents that were accessed. The lower
number assumes that all unjudged docu-
ments are irrelevant; the upper is calculated
assuming that all are relevant.
lipt effectiveness The "1 ipt average" values
are calculated based solely upon the top 200
ranked documents, assuming that all remain-
ing relevant documents are ranked last, and
that all unjudged documents are not rele-
vant.
Index construction New algorithms have been
developed for index construction. Using
these algorithms, the index was built in un-
der 4 hours, using a peak of 40 Mb of main
memory and less than 50 Mb of temporary
disk space above and beyond the final size
of the inverted file. The compression of the
documents took a further 4 hours, for a total
database build time of under 8 hours.
Based upon these experiments we conclude that:
* use of a limited number of accumulators does
not appear to impact retrieval effectiveness,
and so is an extremely attractive heuristic for
ranking large collections because of the dra-
matic savings in retrieval-time memory us-
age that result;
187
* introduction of skips to the compressed in-
verted file entries significantly reduces pro-
cessing time in this restricted-accumulators
environment;
* index compression can, in this way, become
"free": if only partial decoding is required,
the input time saved by compression can be
more than enough to pay the cpu cost offrac-
tional decoding;
* all non-stopped terms should be allowed to
contribute to the ranking, and that the quit
strategy is inferior to continue; and
* even relatively simple pagination gives re-
trieval not measurably inferior to document
retrieval.
When dealing with pages, the retrieval system
was implemented to display the entire document,
but highlight the page or pages that had caused
the document to be ranked highly. This deci-
sion made the paged collection particularly easy
to use, and when we browsed the answers to
the queries it was very satisfying to be able to
find the exact text that had triggered the match.
This advantage of pagination became clear even
in the early stages of the project, while we were
attempting to debug the implementation of the
cosine method.
On the other hand, a problem that was
brought out by these experiments was the dif-
ficulty of comparing retrieval effectiveness in