SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
for the first round of experimentation [6, 9].
The times listed cover all activity from issue
of query until a ranked list of 200 document
numbers is calculated.
0
Co
E
i:
0
10*o -
continue
L=1OO,OOO
L=1O,OOO
L=1 000
*1
1000 10000 100000 1000000
Number of accumulators
Figure 7: Time required by skipping
2.4 Retrieval experiments
The implementation techniques described above,
skipping and limitation of the number of accu-
mulators, can be applied to document retrieval
as well as to page retrieval. Table 1 summarises
the various resources required and performance
achieved by both document and page retrieval,
for various values of L, the nominal number of ac-
cumulators. The following points should be noted
in connection with the data in Table 1
File sizes Stemming was applied during in-
verted file construction, but no words were
stopped. All alphanumeric strings were
indexed, totalling 333,000,000 term occur-
rences of 538,244 distinct terms. The doc-
ument index contained 136,000,000 (d,fd,t)
pointers, the paged index 196,000,000 such
pairs. Each is compressed to occupy less
than one byte. The variation in inverted file
size is due to the insertion of skips, or, in the
case of the "All" row, their absence.
The text itself was also stored compressed
using a word-based model [9], allowing the
2,055 Mb to be reduced to 605 Mb. A fur-
ther 27 Mb of smaller auxiliary files were
also required. In total, the complete retrieval
system for the paged TREC occupied about
817 Mb, or 40% of the initial unindexed text.
Retrieval time The SPARCstation 512 used
for these experiments is approximately 2.2
times faster than the SPARCstation 2 used
186
Fetching of the 200 answers, which occupy
226.9 Kb compressed, adds a uniform 1.9 sec
per query including the cost of decompres-
sion. Decompressed, there 15 an average of
829.9 Kb of output text per query.
Elapsed times during the ranking process
are generally about 1.6 sec greater than cpn
time, in the course of which an average of
42 disk accesses are made to the lexicon; 42
disk accesses made to the inverted file itself
(fetching a total of 1.65 Mb of data, contain-
mg about two million compressed (d, fd,t)
pairs); and 275 accesses to the combined file
of document weights and addresses. All of
these files except the inverted file are small,
and likely to have been buffered into main
memory, hence the small overhead.
Elapsed time for the presentation of docu-
ments was about 5.0 sec per query greater
than cpu time, caused by the need to per-
form a further 200 seeks into the file con-
taming the compressed text. This file is too
large for there to have been any buffering ef-
fect.
Memory space When skipping is employed
(the first three rows in each section of the ta-
ble) memory space during query processing
is proportional to the number of non-zero ac-
cumulators. In these experiments a hash ta-
ble was used, and an average of 14 bytes per
accumulator required. For the "All" exper-
iments an array of accumulators was used,
2.8 Mb for document retrieval, and 6.6 Mb
for page retrieval.
An array of 6-bit approximate document
lengths was used to guide the retrieval pr[OCRerr]
cess [11], requiring 0.5 Mb for document re-
trieval and 1.2 Mb for page retrieval.
Terms processed The values for "Terms pro
cessed" indicate the average number of terms
processed before processing switched from
the first to the second phase of continue. For
each query a whole number of inverted file
entries were processed, and this is why the
"non-zero accumulators" average is greater
than the target value L. Not surprisingly,
with page retrieval fewer terms were re-