SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
1o*o -
try in full can be avoided by including a series
of synchronisation points at which decoding can
commence [10]. These can be arranged as a series
of pointers, or skips, as illustrated in Figure 6.
C)
w skip
E
1.0 - document flu
0
array
continue inverted file entry
quit
0.1 *1
tO 10000 100000 1000000
Number of accumulators
Figure 5: CPU time of quti and con jinue
Whilst the results for retrieval effectiveness are
encouraging, the results for time are not. Figure 5
shows the cost of these two strategies in terms of
cpu time, with the set A of accumulators stored
as a hash table. The times shown include all
processing required to identify the top r = 200
ranked documents, but do not include the cost
of actually retrieving those documents, which, in
a system such as ours, in which the text is also
compressed, involves further decoding effort. All
timings are for a lightly loaded Sun SPARCsta-
tion Model 512 using a single processor; programs
were written in C and compiled using gcc. While
for values of L less than about 100,000 the con-
[OCRerr]inue method takes no longer than the simple co-
sine measure when implemented using an array
of accumulators, it is clearly unsatisfactory com-
pared with the quii technique. In the next section
we discuss methods by which this situation can
be improved.
2.3 Processing long inverted file entries
Compression can reduce index size by a factor
of six; for example, for the paged form of TREC
the size reduces from 1,120 Mb to 184 Mb, an
irresistible saving. Compression, however, pro-
hibits random access into inverted file entries, so
that the whole of each inverted file entry must
be decoded, even though not every (d, fd,t) pair
is required. This is the reason that the con[OCRerr]inue
strategy is slow.
The need to decompress an inverted file en-
Figure 6: Adding skips
The skips divide the inverted file entry into a
series of blocks, and to access a number in the
entry, only the skips and the block containing
the number need to be decoded. Appropriately
coded, such skips increase the size of the com-
pressed inverted file by only a few percent, but
can drastically reduce the amount of decoding re-
quired. Decode time for a skipped inverted file
entry is given by T = [OCRerr]d (2s + Lp/2s), where id
is the time needed to decode one (d, fd,t) pair, p
is the number of such pairs in the inverted file
entry, L is the number of accumulators, and S
is the number of skips. This time is minimised
at S = \/L[OCRerr]/2. For 10,000 accumulators and
an inverted file entry of length 100,000 (a com-
mon figure for queries to the paged TREC),to-
tal time including reading from disk on a typi-
cal system drops from 0.30 seconds to 0.22 sec-
onds [10]. Moreover, under the same assump-
tions an uncompressed inverted file entry of this
length would take 0.30 seconds to read, and so the
skipped compressed inverted file provides faster
ranking than even an uncompressed inverted file.
To test skipping, inverted files were con-
structed for several different fixed values of L and
then, for each inverted file, the cpu time for in-
verted file processing measured for a range of ac-
tual L values. To ensure that inverted files did
not become too large, a minimum block size was
imposed, requiring that every skip cover at least
four (d, fd,t) pairs. The results of these experi-
ments are shown in Figure 7. As can be seen,
substantially faster query processing is the result
when the number of accumulators is less than
about 100,000. As predicted by the analysis, the
index constructed assuming that L = 1,000 gives
the best performance when the number of accu-
mulators (the variable L in Figure 3) is small.
185