NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman 1o*o - try in full can be avoided by including a series of synchronisation points at which decoding can commence [10]. These can be arranged as a series of pointers, or skips, as illustrated in Figure 6. C) w skip E 1.0 - document flu 0 array continue inverted file entry quit 0.1 *1 tO 10000 100000 1000000 Number of accumulators Figure 5: CPU time of quti and con jinue Whilst the results for retrieval effectiveness are encouraging, the results for time are not. Figure 5 shows the cost of these two strategies in terms of cpu time, with the set A of accumulators stored as a hash table. The times shown include all processing required to identify the top r = 200 ranked documents, but do not include the cost of actually retrieving those documents, which, in a system such as ours, in which the text is also compressed, involves further decoding effort. All timings are for a lightly loaded Sun SPARCsta- tion Model 512 using a single processor; programs were written in C and compiled using gcc. While for values of L less than about 100,000 the con- [OCRerr]inue method takes no longer than the simple co- sine measure when implemented using an array of accumulators, it is clearly unsatisfactory com- pared with the quii technique. In the next section we discuss methods by which this situation can be improved. 2.3 Processing long inverted file entries Compression can reduce index size by a factor of six; for example, for the paged form of TREC the size reduces from 1,120 Mb to 184 Mb, an irresistible saving. Compression, however, pro- hibits random access into inverted file entries, so that the whole of each inverted file entry must be decoded, even though not every (d, fd,t) pair is required. This is the reason that the con[OCRerr]inue strategy is slow. The need to decompress an inverted file en- Figure 6: Adding skips The skips divide the inverted file entry into a series of blocks, and to access a number in the entry, only the skips and the block containing the number need to be decoded. Appropriately coded, such skips increase the size of the com- pressed inverted file by only a few percent, but can drastically reduce the amount of decoding re- quired. Decode time for a skipped inverted file entry is given by T = [OCRerr]d (2s + Lp/2s), where id is the time needed to decode one (d, fd,t) pair, p is the number of such pairs in the inverted file entry, L is the number of accumulators, and S is the number of skips. This time is minimised at S = \/L[OCRerr]/2. For 10,000 accumulators and an inverted file entry of length 100,000 (a com- mon figure for queries to the paged TREC),to- tal time including reading from disk on a typi- cal system drops from 0.30 seconds to 0.22 sec- onds [10]. Moreover, under the same assump- tions an uncompressed inverted file entry of this length would take 0.30 seconds to read, and so the skipped compressed inverted file provides faster ranking than even an uncompressed inverted file. To test skipping, inverted files were con- structed for several different fixed values of L and then, for each inverted file, the cpu time for in- verted file processing measured for a range of ac- tual L values. To ensure that inverted files did not become too large, a minimum block size was imposed, requiring that every skip cover at least four (d, fd,t) pairs. The results of these experi- ments are shown in Figure 7. As can be seen, substantially faster query processing is the result when the number of accumulators is less than about 100,000. As predicted by the analysis, the index constructed assuming that L = 1,000 gives the best performance when the number of accu- mulators (the variable L in Figure 3) is small. 185