SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text Alan Kent Alistair Moffat Ron Sacks-Davis Ross Wilkinson Justin Zobel Collaborative Information Technology Research Institute Departments of Computer Science RMIT and The University of Melbourne 723 Swanston St, Carlton, Melbourne 3053 Australia {ajk, alistair, rsd, ross, jz}@kbs.citri.edu.au 1 Introduction Large document collections present many problems for practical information retrieval. To avoid unnecessary accesses to the text of the collection during query evaluation, comprehensive in- dexes are required. It must be possible to create and access these indexes in a reasonable amount of time, and to store them, and the data itself, in a reasonable amount of space. More- over, although advanced indexes permit efficient evaluation of conventional ranking measures, they make only limited use of any structure inherent in the query. Here we describe two separate streams of research that address these problems. In the first stream we have over the last two years developed indexing and compression techniques that allow the text to be stored compressed [2, 8]; provide fast access to document collections via compressed indexes [2, 9, 16]; create indexes with a fast inversion technique [7]; and permit fast ranking of large document collections [17]. On the small document collections initially available to us (the largest was 132 Mb), compressed text and index typically required less than 35% of the space required for the source text; query response times were a few seconds; and peak memory requirements were less than a Megabyte during query processing. In Section 2 we describe the application of these techniques to the TREC data. In the second stream of research we used the compression and indexing techniques discussed above, as well as signature file indexing techniques developed previously by the group [5, 11], to investigate the extent to which the structure of queries aids the retrieval process. In Section 3 we report on a series of experiments that examine how this structure may be used in both generation of Boolean queries and in ranking. As part of these experiments we are able to test the advantage of using adjacent pairs in retrieval. 2 Compression and Inverted File Indexing We first descn be the structure of text databases with inverted file indexes, and how documents are ranked with regard to a query. Then, in Section 2.3, we describe the compression of the text of the documents, and in Section 2.4 describe the representation of the inverted file and 229