SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
techniques [1, 13, 14]. The cosine measure evaluates the relevance of a document to a query
via the function
Wq,t Wd,t
cosine(q, d) =
V([OCRerr][OCRerr] wq2,[OCRerr]) V([OCRerr][OCRerr] wd2,t)
where q is the query, d is the document, and W[OCRerr],t is the weight of word tin document or query
x. A common function for assigning weights to words in document or query x is to use the
frequency-modified inverse document frequency, described by
Wx,t = [OCRerr] log6(N/ft)
where tfX[OCRerr]t is the number of occurrences of word tin x, N is the number of documents in the
collection, and ft is the number of documents containing t. This is commonly called the tf idf
weighting.
The information required to compute the sum [OCRerr]t Wq,t Wd,t is held in the inverted index.
To answer the query, an accumulator with initial value zero is created for each document in
the database; the index entry for each word in the query is retrieved; and, for each `word-
number, document-number, frequency-in-document' triple, a cosine contribution is added into
the corresponding accumulator. This technique is also used by Harman and Candela [4].
For each document in a static database the document length, [OCRerr]t Wd2,t, is a constant, and
can be pre-computed and stored at the time the database is created. Thus, to compute the final
ranking after the accumulators have been calculated, the document length for each document
with a non-zero accumulator value must be fetched, the accumulator value divided by this
length.
For the queries supplied with TREC, over 90% of the documents contained one or more
of the query terms, meaning that over 90% of the document lengths needed to be accessed.
Moreover, at over 2 Mb, the table of document lengths is large, and in practice space limitations
might prevent it being held in memory. We have shown, however, that keeping limited precision
approximate lengths in memory can largely eliminate the cost of fetching document lengths from
disk [17]. Suppose we are prepared to allow b bits of main memory storage for each document
length x, where L < x < U. Then if B = [(U + E)IL]2[OCRerr]b, where E is a very small number, the
function f(x) = LlOgB(x/L)j yields integers C between 0 and 2b - 1 for all legal values of x,
and g(c) = L BC can be used to compute approximations to x.
The C values stored can then be used in either of two ways. First, the approximate lengths
can be used to determine which exact document lengths should be retrieved from disk. Second,
cosine can be computed with y(c), yielding an approximate ranking. With as few as two bits
per length, there was only a small decline in retrieval performance in the five small document
collections for which we had relevance judgments [17]. The effects of these techniques on TREC
are discussed in Section 2.5.
2.3 Text compression
Using a word-based model of the text, the space required to store the documents comprising a
database can be reduced to less than 30% of the original size [2, 8, 15]. Each word occurrence
in the text is replaced by a canonical Huffman code, the length of which is dependent on the
frequency of the word, and the intervening `non-words' are similarly coded against a vocabulary
of non-words. Thus, by alternately decoding a word, then a non-word, and so on down the
compressed text, the exact form of the input document can be reconstructed. For further
details of the scheme used the reader is referred to our previous work [15].
231