SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman techniques [1, 13, 14]. The cosine measure evaluates the relevance of a document to a query via the function Wq,t Wd,t cosine(q, d) = V([OCRerr][OCRerr] wq2,[OCRerr]) V([OCRerr][OCRerr] wd2,t) where q is the query, d is the document, and W[OCRerr],t is the weight of word tin document or query x. A common function for assigning weights to words in document or query x is to use the frequency-modified inverse document frequency, described by Wx,t = [OCRerr] log6(N/ft) where tfX[OCRerr]t is the number of occurrences of word tin x, N is the number of documents in the collection, and ft is the number of documents containing t. This is commonly called the tf idf weighting. The information required to compute the sum [OCRerr]t Wq,t Wd,t is held in the inverted index. To answer the query, an accumulator with initial value zero is created for each document in the database; the index entry for each word in the query is retrieved; and, for each `word- number, document-number, frequency-in-document' triple, a cosine contribution is added into the corresponding accumulator. This technique is also used by Harman and Candela [4]. For each document in a static database the document length, [OCRerr]t Wd2,t, is a constant, and can be pre-computed and stored at the time the database is created. Thus, to compute the final ranking after the accumulators have been calculated, the document length for each document with a non-zero accumulator value must be fetched, the accumulator value divided by this length. For the queries supplied with TREC, over 90% of the documents contained one or more of the query terms, meaning that over 90% of the document lengths needed to be accessed. Moreover, at over 2 Mb, the table of document lengths is large, and in practice space limitations might prevent it being held in memory. We have shown, however, that keeping limited precision approximate lengths in memory can largely eliminate the cost of fetching document lengths from disk [17]. Suppose we are prepared to allow b bits of main memory storage for each document length x, where L < x < U. Then if B = [(U + E)IL]2[OCRerr]b, where E is a very small number, the function f(x) = LlOgB(x/L)j yields integers C between 0 and 2b - 1 for all legal values of x, and g(c) = L BC can be used to compute approximations to x. The C values stored can then be used in either of two ways. First, the approximate lengths can be used to determine which exact document lengths should be retrieved from disk. Second, cosine can be computed with y(c), yielding an approximate ranking. With as few as two bits per length, there was only a small decline in retrieval performance in the five small document collections for which we had relevance judgments [17]. The effects of these techniques on TREC are discussed in Section 2.5. 2.3 Text compression Using a word-based model of the text, the space required to store the documents comprising a database can be reduced to less than 30% of the original size [2, 8, 15]. Each word occurrence in the text is replaced by a canonical Huffman code, the length of which is dependent on the frequency of the word, and the intervening `non-words' are similarly coded against a vocabulary of non-words. Thus, by alternately decoding a word, then a non-word, and so on down the compressed text, the exact form of the input document can be reconstructed. For further details of the scheme used the reader is referred to our previous work [15]. 231