SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
The decompression process is very fast. The canonical Huffman code used is particularly
amenable to a table-based `look up and copy' implementation, and each such decoding step
generates several output bytes. As a result, the compression regime has only limited impact
on retrieval time. Moreover, the small amounts of time needed for decompression are partially
(and sometimes fully) offset by reduced disk traffic.
Because every word and non-word decompressed is retrieved from the compression model,
it is crucial that this model be held in main memory. If the compression model was stored on
disk, two or more disk accesses would be required for each word in each document retrieved.
On the smaller databases, the compression model did not exceed about 500 Kb. We were very
interested to see how much space would be required to hold the model for the TREC collection.
2.4 Inverted file compression
The inverted file index entries should also be compressed, since without compression the in-
verted file might take as much space as the original text itself [2, 9, 16]. Good compression
of index entries can be achieved by numbering documents sequentially from 1, sorting index
entries, representing each sequence of identifiers as a sequence of differences or gaps, and then
using compact representations of the generally small integers that result.
For example, a word might appear in the first, fourth, seventeenth, ..., documents, with
an index entry of
This can be represented by the sequence of differences
1,3,13,74,22,...
which can then be compressed.
We have experimented with several different techniques for compressing these runlengths,
in two broad classes [8, 9]. Global methods use the same encoding for all entries, and so have
the advantage of being more general, but are insensitive to the frequency of each term. Local
methods code each entry using a code that is adjusted to take into account the frequency of
the term.
Use of a variable length prefix code allows the frequent small runlengths to be represented
more succinctly that the less frequent long runlengths, and is a marked improvement over the
standard binary encoding. The further advantage of using a local code arises because of the
high variability of word frequency, and the effect this has on the average runlength within the
entry. For example, the word `the'1 can be expected to have a sequence of runlengths where
each difference is very small, usually one. On the other hand, a rare word such as `palimpsest'
will have gaps that are often much larger, but also sometimes small, since rare words will tend
to occur in clusters of related documents within the collection. Hence, a local compression
method that adjusts its codes according to word frequency can be expected to perform better
than a global method, and this is indeed the case [9]. Term `frequency-within-document' values
are efficiently represented by use of the [OCRerr] code [16]
After compression, the index entries for a text database will typically occupy less than 10%
of the space required for the text itself. Decompression is fast, with about 100,000 document
identifiers decoded per second on a 25 MIP Sun SPARC 2. None of these compression methods
require any significant memory resources, once the index entry has been retrieved.
1We did not `stop' any words, and also indexed all numbers. We did, however, stem all words before indexing,
using Lovins's algorithm [6].
232