NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Retrieval of Partial Documents chapter A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology D. K. Harman 1. Order the words iu the query from highest weight to lowest. 2. Set A 3. For each word tin the query, (a) Retrieve It. (b) For each [OCRerr]d, fd,t) pair in It[OCRerr] i. If Ad E A, calculate Ad Ad + Wq,t Wd,t. ji. Otherwise, set A A + {Ad}, cal- culate Ad Wq,t Wd,t. (c) If IAI > L, go to step 4. 20 - 4. For each remaining word tin the query, (a) Retrieve It. (b) For each d such that Ad E A, if (d,fd,t[OCRerr] E I[OCRerr], calculate Ad Ad+ Wq,t Wd,t. 5. For each document d such that Ad [OCRerr] A, cal- culate Cd Ad/Wd. (0- 0 0 15 - sured as an lipt average precision, as a function of k = Al, the number of accumulators actually used. The small numbers beside the continue curve show the average number of terms pro- cessed in phase one of each query. For example, only 8.2 terms are needed to generate 27,000 ac- cumulators. The difference between quit and con- tinue is marked, and, perhaps surprisingly, even the mid to low weight terms appear to contribute to the effectiveness of the cosine rule. Ignoring them leads to significantly poorer retrieval. 6. Identify the r highest values of Cd. Figure 3: Con[OCRerr]inue algorithm for using approxi- mately L accumulators 10 continue D quit 1000 10000 100000 1000000 inverted file entries after the bound on the num- ber of accumulators is reached, but allow no new documents into the accumulator set. This con- tinue algorithm is illustrated in Figure 3. The algorithm has two distinct phases. In the first phase, accumulators are added freely, as in the qui[OCRerr] algorithm. In the second stage, existing ac- cumulator values are updated but no new accu- mulators are added. Both qui[OCRerr] and continue gen- erate the same set of approximately L candidate answers, but in a different permutation, so when the top r documents are extracted from this set and returned, different retrieval effectiveness can be expected, particularly if r 俠. A similar strategy to that of continue is de- scribed by flarman and Candela [3, 4], although their motivation is somewhat different-they de- sire a small number of accumulators to reduce the sorting time required for a total ranking of the collection, whereas we assume that only a small fraction of the collection is to be presented. In this latter case, to find the top r documents a heap is the appropriate data structure, and O(N + r log N) time is required, a small fraction of query processing time. Figure 4 shows retrieval effectiveness, mea- Number of accumulators Figure 4: Effectiveness of quit and continue In these experiments direct assessment of the relevance of pages was not possible because the relevance judgements were for whole documents. To measure effectiveness, it was assumed that if a document was relevant then each page from it was relevant, but only the most highly ranked page from each document was returned and counted. We believe this method to be fair, since one way in which pages might be used would be to rank on pages but return whole documents, with perhaps some highlighting of the top-ranked pages. Other strategies for using parts of documents to select whole documents are discussed in Section 3. Also surprising is that the continue strategy, with restricted numbers of accumulators, is capa- ble of better retrieval performance than the orig- inal cosine method, in which all pages are per- mitted accumulators. It would appear that the mid to low weight terms, while contributing to retrieval effectiveness, should not be permitted to collaborate and select documents that contain none of the more highly weighted terms. 184