NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman 2.2 Management of accumulators An effective way to rank documents against a query is with the cosine measure, defined by the function Wq,t Wd,t Gd [OCRerr]q2,t[OCRerr][OCRerr]2,t' where q is the query, d is the document, and W[OCRerr],t is the weight of word i in document or query x. We assume that Wd = V(Z[OCRerr] wd2,t) is the length of document d, and that the top r answers are to be retrieved and returned to the user. In our experiments term weights were calcu- lated using the IDE rule Wd,t = fd,t log(N/ft), where fd,t is the number of appearances of term tin document d, N is the number of documents in the collection, and ft is the number of docu- ments that contain term t. Harman [3] gives a summary of ranking techniques and a discussion of the cosine measure and IDE weighting rule. To support the cosine measure in an inverted file text database system, each inverted file entry contains a sequence of (d, fd,t) pairs for some term i. The value ft, the number of documents con- taining t, is stored in the lexicon, and from these two the cosine contribution fd,t (log(N/ft))2 of term tin document d can be calculated. A struc- ture of accumulators is used to collect these con- tributions. As the inverted file entry for each query term is processed, the accumulator value for each document number in the inverted file entry is updated by adding in each partial co- sine value as it is calculated, so that by the time all inverted file entries have been processed the accumulator for document d contains the value fd,t (log(N/ft))2 = Wqt Wd,t. The number of accumulators required to pro- cess a query can be large. Queries were created from TREC topics 51-100 by simply extracting and stemming all of the alphanumeric strings, and stopping 601 closed-class words. The gen- erated queries had an average of over 40 terms, and each resulted in about 75% of the pages of TREC having a non-zero accumulator value. With this ratio the most memory-efficient accu- mulator structure is an array. But at 32 or 64 bits per accumulator, this structure is formidably large, and dominates the retrieval-time mem- ory requirements. Even initialising it is time- consuming. 1. Order the words in the query from highest weight to lowest. 2. Set A $; A is the current set of accumula- tors. 3. For each word tin the query, (a) (b) Retrieve It, the inverted file entry for t. For each (d, [OCRerr] pair in It, i. If the accumulator Ad E A, calcu- late Ad Ad + wq,t Wd,t. ji. Otherwise, set A A+{Ad}, cal- culate Ad Wq,t Wd,t. (c) If Al >[OCRerr] L, go to step 4. 4. For each document d such that Ad C A, cal- culate Cd Ad/Wd. 5. Identify the r highest values of Cd using a heap. Figure 2: Quit algorithm for computing cosine using approximately L accumulators A simple strategy for restricting the number of accumulators is to order query terms by decreas- ing weight, and only process terms until some designated termination condition is met. One such condition is to impose an a priori bound L on the number of non-zero accumulators, and to stop processing terms when this bound is reached. Such a ranking process, which we call quit, is il- lustrated in Figure 2. Other possible termination conditions would be to place a limit on the num- ber of terms considered or on the total number of pointers decoded; or to place an upper bound on the term frequency ft, and only process terms that appear in fewer than x% of the documents, for some predetermined value x. Buckley and Le- wit [1]; Lucarella [8]; and Wong and Lee [13] have also considered various stopping rul[OCRerr]s, and derive conditions under which inverted file entries can be discarded without affecting the r top documents (although their order within the ranking may be different). Here we are prepared to allow approx- imate as well as exact rules, acknowledging that the cosine measure is itself a heuristic. Quitting has the advantage of only process- ing the inverted file entries of a subset of the query terms, and hence of faster ranking, but at the possible expense of poor retrieval perfor- mance, depending upon how discriminating the low weighted terms are. An alternative is to continue the processing of 183