SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman The decompression process is very fast. The canonical Huffman code used is particularly amenable to a table-based `look up and copy' implementation, and each such decoding step generates several output bytes. As a result, the compression regime has only limited impact on retrieval time. Moreover, the small amounts of time needed for decompression are partially (and sometimes fully) offset by reduced disk traffic. Because every word and non-word decompressed is retrieved from the compression model, it is crucial that this model be held in main memory. If the compression model was stored on disk, two or more disk accesses would be required for each word in each document retrieved. On the smaller databases, the compression model did not exceed about 500 Kb. We were very interested to see how much space would be required to hold the model for the TREC collection. 2.4 Inverted file compression The inverted file index entries should also be compressed, since without compression the in- verted file might take as much space as the original text itself [2, 9, 16]. Good compression of index entries can be achieved by numbering documents sequentially from 1, sorting index entries, representing each sequence of identifiers as a sequence of differences or gaps, and then using compact representations of the generally small integers that result. For example, a word might appear in the first, fourth, seventeenth, ..., documents, with an index entry of This can be represented by the sequence of differences 1,3,13,74,22,... which can then be compressed. We have experimented with several different techniques for compressing these runlengths, in two broad classes [8, 9]. Global methods use the same encoding for all entries, and so have the advantage of being more general, but are insensitive to the frequency of each term. Local methods code each entry using a code that is adjusted to take into account the frequency of the term. Use of a variable length prefix code allows the frequent small runlengths to be represented more succinctly that the less frequent long runlengths, and is a marked improvement over the standard binary encoding. The further advantage of using a local code arises because of the high variability of word frequency, and the effect this has on the average runlength within the entry. For example, the word `the'1 can be expected to have a sequence of runlengths where each difference is very small, usually one. On the other hand, a rare word such as `palimpsest' will have gaps that are often much larger, but also sometimes small, since rare words will tend to occur in clusters of related documents within the collection. Hence, a local compression method that adjusts its codes according to word frequency can be expected to perform better than a global method, and this is indeed the case [9]. Term `frequency-within-document' values are efficiently represented by use of the [OCRerr] code [16] After compression, the index entries for a text database will typically occupy less than 10% of the space required for the text itself. Decompression is fast, with about 100,000 document identifiers decoded per second on a 25 MIP Sun SPARC 2. None of these compression methods require any significant memory resources, once the index entry has been retrieved. 1We did not `stop' any words, and also indexed all numbers. We did, however, stem all words before indexing, using Lovins's algorithm [6]. 232