SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
Compression, Fast Indexing, and Structured Queries
on a Gigabyte of Text
Alan Kent
Alistair Moffat
Ron Sacks-Davis
Ross Wilkinson
Justin Zobel
Collaborative Information Technology Research Institute
Departments of Computer Science
RMIT and The University of Melbourne
723 Swanston St, Carlton, Melbourne 3053
Australia
{ajk, alistair, rsd, ross, jz}@kbs.citri.edu.au
1 Introduction
Large document collections present many problems for practical information retrieval. To avoid
unnecessary accesses to the text of the collection during query evaluation, comprehensive in-
dexes are required. It must be possible to create and access these indexes in a reasonable
amount of time, and to store them, and the data itself, in a reasonable amount of space. More-
over, although advanced indexes permit efficient evaluation of conventional ranking measures,
they make only limited use of any structure inherent in the query.
Here we describe two separate streams of research that address these problems. In the first
stream we have over the last two years developed indexing and compression techniques that
allow the text to be stored compressed [2, 8]; provide fast access to document collections via
compressed indexes [2, 9, 16]; create indexes with a fast inversion technique [7]; and permit
fast ranking of large document collections [17]. On the small document collections initially
available to us (the largest was 132 Mb), compressed text and index typically required less than
35% of the space required for the source text; query response times were a few seconds; and
peak memory requirements were less than a Megabyte during query processing. In Section 2
we describe the application of these techniques to the TREC data.
In the second stream of research we used the compression and indexing techniques discussed
above, as well as signature file indexing techniques developed previously by the group [5, 11], to
investigate the extent to which the structure of queries aids the retrieval process. In Section 3
we report on a series of experiments that examine how this structure may be used in both
generation of Boolean queries and in ranking. As part of these experiments we are able to test
the advantage of using adjacent pairs in retrieval.
2 Compression and Inverted File Indexing
We first descn be the structure of text databases with inverted file indexes, and how documents
are ranked with regard to a query. Then, in Section 2.3, we describe the compression of the
text of the documents, and in Section 2.4 describe the representation of the inverted file and
229