SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman Inversion The inversion process was performed as two passes in tandem with the two compression passes. The first stage, counting term frequencies, had similar time and memory requirements to the first component of the compression process, but could not actually share data structures because of differing definitions of `words' and the need, in the index, for stemming. The second pass of the inversion process was where we expected to encounter problems. The in-memory technique requires the allocation of memory a little larger than the final size of the compressed inverted file, and our previous experiments had indicated that this would be roughly 10% of the size of the input text [7]. That is, we expected to use more than 200 Mb of the 256 Mb memory available. However, the long TREC documents-averaging 470 words rather than the 90 for our previous largest collection-meant that the inverted file was smaller than anticipated, and, as it turned out, the inversion process ran smoothly to completion in 162 Mb, and contributed five and a half hours to the time of the combined second pass. One would not want to do this every day, but these resource requirements are well within the capabilities of a typical $50,000 workstation. Using the `local VG' code [8, 9), the final inverted file required 132 Mb, or 6.4% of the input text, of which about 40 Mb was by [OCRerr] coded `frequency-within-document' values. Hence, if only Boolean queries were to be supported, the inverted file could be reduced to under 100 Mb. The second pass of the inversion process also produced three other files-the index mapping for the inverted file, giving the bit address for each entry; the file of exact document lengths (later merged with the document mapping to form the document information file); and a file of approximate document lengths. The sizes of these, plus all of the other components of the compressed database, are shown in Table 1. Performance on ranked queries The 50 test queries were transformed, by removal of SGML tags, stop words, and punctuation, into lists of words to which the cosine measure could be applied. The top 200 ranked documents were then retrieved and decompressed for each query. ___________ ___________________ Accesses (Kb) (sec) [OCRerr] ed 1 Operation [OCRerr]_File } Disk Data CPU Time [OCRerr] ElasePc5) j Searching Index mapping [OCRerr] 43 170 Ranking Inverted Index T 43 1708 _____________ Document Information [OCRerr] 219 7 16.6 18.2 Compression Compressed text 200 - 203 _____________ Answers t[; 729 3.7 7.2 [OCRerr] Total ]______________________ 5 [OCRerr] 2817 [ 20.3 [OCRerr] 25.4 ] Table 2: Ranked queries on TREC (b = 8) Processing of these queries was remarkably quick. Table 2 shows, for the three stages of processing each query, the average amount of data involved and the average time required. (We did not separate the time taken by the searching algorithm and the ranking algorithm.) As can be seen, on average the queries were processed in about 20 seconds of CPU time. Given that each query involved approximately 2.1 million document numbers and 600,000 non-zero accumulators, we were very pleased with this result. Note that the times listed in 234