NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman for the first round of experimentation [6, 9]. The times listed cover all activity from issue of query until a ranked list of 200 document numbers is calculated. 0 Co E i: 0 10*o - continue L=1OO,OOO L=1O,OOO L=1 000 *1 1000 10000 100000 1000000 Number of accumulators Figure 7: Time required by skipping 2.4 Retrieval experiments The implementation techniques described above, skipping and limitation of the number of accu- mulators, can be applied to document retrieval as well as to page retrieval. Table 1 summarises the various resources required and performance achieved by both document and page retrieval, for various values of L, the nominal number of ac- cumulators. The following points should be noted in connection with the data in Table 1 File sizes Stemming was applied during in- verted file construction, but no words were stopped. All alphanumeric strings were indexed, totalling 333,000,000 term occur- rences of 538,244 distinct terms. The doc- ument index contained 136,000,000 (d,fd,t) pointers, the paged index 196,000,000 such pairs. Each is compressed to occupy less than one byte. The variation in inverted file size is due to the insertion of skips, or, in the case of the "All" row, their absence. The text itself was also stored compressed using a word-based model [9], allowing the 2,055 Mb to be reduced to 605 Mb. A fur- ther 27 Mb of smaller auxiliary files were also required. In total, the complete retrieval system for the paged TREC occupied about 817 Mb, or 40% of the initial unindexed text. Retrieval time The SPARCstation 512 used for these experiments is approximately 2.2 times faster than the SPARCstation 2 used 186 Fetching of the 200 answers, which occupy 226.9 Kb compressed, adds a uniform 1.9 sec per query including the cost of decompres- sion. Decompressed, there 15 an average of 829.9 Kb of output text per query. Elapsed times during the ranking process are generally about 1.6 sec greater than cpn time, in the course of which an average of 42 disk accesses are made to the lexicon; 42 disk accesses made to the inverted file itself (fetching a total of 1.65 Mb of data, contain- mg about two million compressed (d, fd,t) pairs); and 275 accesses to the combined file of document weights and addresses. All of these files except the inverted file are small, and likely to have been buffered into main memory, hence the small overhead. Elapsed time for the presentation of docu- ments was about 5.0 sec per query greater than cpu time, caused by the need to per- form a further 200 seeks into the file con- taming the compressed text. This file is too large for there to have been any buffering ef- fect. Memory space When skipping is employed (the first three rows in each section of the ta- ble) memory space during query processing is proportional to the number of non-zero ac- cumulators. In these experiments a hash ta- ble was used, and an average of 14 bytes per accumulator required. For the "All" exper- iments an array of accumulators was used, 2.8 Mb for document retrieval, and 6.6 Mb for page retrieval. An array of 6-bit approximate document lengths was used to guide the retrieval pr[OCRerr] cess [11], requiring 0.5 Mb for document re- trieval and 1.2 Mb for page retrieval. Terms processed The values for "Terms pro cessed" indicate the average number of terms processed before processing switched from the first to the second phase of continue. For each query a whole number of inverted file entries were processed, and this is why the "non-zero accumulators" average is greater than the target value L. Not surprisingly, with page retrieval fewer terms were re-