SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
We have also used these inverted file encodings in the inversion process required during
database construction, again capitalising on the large main memory offered by current work-
stations and the ability to perform two passes over a static database [7]. An alternative
approach is the disk-based method described by Harman and Candela [4]; they report that the
inversion of an 806 Mb database required 313 hours and 163 Mb of disk work space to produce
an inverted index of 112 Mb. We were thus very interested to see whether the in-memory
approach would be successful for the TREC collection.
2.5 Experimental results
This section describes the performance of our techniques on the 2055 Mb TREC collection. All
results are for a 25 MIP Sun SPARC 2 with 256 Mb of main memory and no other active user
processes.
Compression
The two components of the compression process took about three hours each, generating a
compression model and a compressed database of 4.2 Mb and 605.1 Mb respectively (Table 1).
The second pass also generated a file containing the document mapping, a little under 3 Mb.
That is, including both of these auxiliary files, the document collection was reduced from
2055 Mb to 612 Mb, or 29.8%, a disk saving of 1400 Mb.
The need to store the coding model meant that the compression passes were expensive
on memory. With more than 900,000 unique symbols, and an additional storage requirement
of about 12 bytes per word for the search structure, each pass required about 25 Mb of data
structures. These figures reinforce our contention that database creation needs to be performed
on a reasonably well configured machine, even if query processing is performed elsewhere.
%I (Mb, peak)
[OCRerr] Pass [OCRerr] Output files I CPU Time [OCRerr][OCRerr]ry
First pass: 1
Compression Compression model 4.2 0.2 2:37 25.6
Inversion Vocabulary 6.4 0.3 3:02 18.7
Overhead [OCRerr] 10.6 0.5 0:19 2.5
Total ____________ 5:58 46.8
Second pass:
Compression Compressed text 605.1 29.4 3:27 25.6
Document ma[OCRerr]ping 2.8 0.1
Inversion Inverted index 132.2 6.4 5:25 162.1
Inverted index mapping 2.1 0.1
Document lengths 2.8 0.1
Approximate lengths 0.7 0.0
Overhead _______________________ ____________ 0:23 2.5
Total 745.8 36.3 9:15 190.2
[OCRerr] Overall Total [ [OCRerr] 756.4 36.8 15:13 [OCRerr] 190.2 ]
Table 1: Files in compressed TREC database (b = 8). Percentages are relative to the size of
the original document collection (2055.3 Mb)
233