SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
associated information. The result of applying these techniques to the TREC data is described
in Section 2.5.
2.1 Text databases
Text databases provide access to text documents on the basis of their content. We assume that
access is via the inverted file indexing scheme we have described elsewhere [16]. In this scheme,
each distinct word in the database is held in a vocabulary, which may be an array, or may be
a search structure such as a B-tree. For each word in the vocabulary the inverted index stores
an index entry, a list of the identifiers of documents containing the word. Interleaved with the
identifiers are the number of occurrences of each word in each document. Since the index
entries contain ordinal document identifiers rather than disk addresses, a document mapping
is needed to convert identifiers into addresses. This will require either a non-trivial amount
of memory space, or must be stored on disk in an auxiliary file. Similarly, an index mapping
is required to turn an ordinal word number into the address of the corresponding index entry.
Query [[OCRerr]bularY
Inverted
--------- Index
L Search __ Entries
Algorithm
mApproximate
Lengths
[OCRerr]: Ranking
Algorithm
F;Hidexm
Mapping Compression
L;-iodel Compressed
Text
Answers [OCRerr]Compression:
Algorithm
Figure 1: Organisation of text databases
~ment
LL-jrmation
Other components of the database are the compress[OCRerr]on model, used for text compression,
and the document lengths, used for ranking. These are discussed later. In our experiments, the
vocabulary and compression model were held in memory, together with an array of approximate
lengths, also described below; all of the other structures were stored on disk. Figure 1 shows
the relationship between these various components.
2.2 Query evaluation
Given an inverted file index of the format described, it is straightforward to rank the documents
in a database with respect to an informally phrased query, returning, say, the top ten documents
as answers. As an example, consider the cosine measure, one of the most effective ranking
230