SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
Inversion
The inversion process was performed as two passes in tandem with the two compression passes.
The first stage, counting term frequencies, had similar time and memory requirements to the
first component of the compression process, but could not actually share data structures because
of differing definitions of `words' and the need, in the index, for stemming.
The second pass of the inversion process was where we expected to encounter problems.
The in-memory technique requires the allocation of memory a little larger than the final size
of the compressed inverted file, and our previous experiments had indicated that this would be
roughly 10% of the size of the input text [7]. That is, we expected to use more than 200 Mb
of the 256 Mb memory available. However, the long TREC documents-averaging 470 words
rather than the 90 for our previous largest collection-meant that the inverted file was smaller
than anticipated, and, as it turned out, the inversion process ran smoothly to completion in
162 Mb, and contributed five and a half hours to the time of the combined second pass. One
would not want to do this every day, but these resource requirements are well within the
capabilities of a typical $50,000 workstation.
Using the `local VG' code [8, 9), the final inverted file required 132 Mb, or 6.4% of the input
text, of which about 40 Mb was by [OCRerr] coded `frequency-within-document' values. Hence, if only
Boolean queries were to be supported, the inverted file could be reduced to under 100 Mb.
The second pass of the inversion process also produced three other files-the index mapping
for the inverted file, giving the bit address for each entry; the file of exact document lengths
(later merged with the document mapping to form the document information file); and a file
of approximate document lengths. The sizes of these, plus all of the other components of the
compressed database, are shown in Table 1.
Performance on ranked queries
The 50 test queries were transformed, by removal of SGML tags, stop words, and punctuation,
into lists of words to which the cosine measure could be applied. The top 200 ranked documents
were then retrieved and decompressed for each query.
___________ ___________________ Accesses (Kb) (sec) [OCRerr] ed 1
Operation [OCRerr]_File } Disk Data CPU Time [OCRerr] ElasePc5) j
Searching Index mapping [OCRerr] 43 170
Ranking Inverted Index T 43 1708
_____________ Document Information [OCRerr] 219 7 16.6 18.2
Compression Compressed text 200 - 203
_____________ Answers t[; 729 3.7 7.2
[OCRerr] Total ]______________________ 5 [OCRerr] 2817 [ 20.3 [OCRerr] 25.4 ]
Table 2: Ranked queries on TREC (b = 8)
Processing of these queries was remarkably quick. Table 2 shows, for the three stages of
processing each query, the average amount of data involved and the average time required.
(We did not separate the time taken by the searching algorithm and the ranking algorithm.)
As can be seen, on average the queries were processed in about 20 seconds of CPU time.
Given that each query involved approximately 2.1 million document numbers and 600,000
non-zero accumulators, we were very pleased with this result. Note that the times listed in
234