SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
1. Order the words iu the query from highest
weight to lowest.
2. Set A
3. For each word tin the query,
(a) Retrieve It.
(b) For each [OCRerr]d, fd,t) pair in It[OCRerr]
i. If Ad E A, calculate Ad Ad +
Wq,t Wd,t.
ji. Otherwise, set A A + {Ad}, cal-
culate Ad Wq,t Wd,t.
(c) If IAI > L, go to step 4. 20 -
4. For each remaining word tin the query,
(a) Retrieve It.
(b) For each d such that Ad E A,
if (d,fd,t[OCRerr] E I[OCRerr], calculate Ad Ad+
Wq,t Wd,t.
5. For each document d such that Ad [OCRerr] A, cal-
culate Cd Ad/Wd.
(0-
0
0
15 -
sured as an lipt average precision, as a function
of k = Al, the number of accumulators actually
used. The small numbers beside the continue
curve show the average number of terms pro-
cessed in phase one of each query. For example,
only 8.2 terms are needed to generate 27,000 ac-
cumulators. The difference between quit and con-
tinue is marked, and, perhaps surprisingly, even
the mid to low weight terms appear to contribute
to the effectiveness of the cosine rule. Ignoring
them leads to significantly poorer retrieval.
6. Identify the r highest values of Cd.
Figure 3: Con[OCRerr]inue algorithm for using approxi-
mately L accumulators
10
continue
D quit
1000 10000 100000 1000000
inverted file entries after the bound on the num-
ber of accumulators is reached, but allow no new
documents into the accumulator set. This con-
tinue algorithm is illustrated in Figure 3. The
algorithm has two distinct phases. In the first
phase, accumulators are added freely, as in the
qui[OCRerr] algorithm. In the second stage, existing ac-
cumulator values are updated but no new accu-
mulators are added. Both qui[OCRerr] and continue gen-
erate the same set of approximately L candidate
answers, but in a different permutation, so when
the top r documents are extracted from this set
and returned, different retrieval effectiveness can
be expected, particularly if r ĞL.
A similar strategy to that of continue is de-
scribed by flarman and Candela [3, 4], although
their motivation is somewhat different-they de-
sire a small number of accumulators to reduce
the sorting time required for a total ranking of
the collection, whereas we assume that only a
small fraction of the collection is to be presented.
In this latter case, to find the top r documents
a heap is the appropriate data structure, and
O(N + r log N) time is required, a small fraction
of query processing time.
Figure 4 shows retrieval effectiveness, mea-
Number of accumulators
Figure 4: Effectiveness of quit and continue
In these experiments direct assessment of the
relevance of pages was not possible because the
relevance judgements were for whole documents.
To measure effectiveness, it was assumed that if a
document was relevant then each page from it was
relevant, but only the most highly ranked page
from each document was returned and counted.
We believe this method to be fair, since one way
in which pages might be used would be to rank on
pages but return whole documents, with perhaps
some highlighting of the top-ranked pages. Other
strategies for using parts of documents to select
whole documents are discussed in Section 3.
Also surprising is that the continue strategy,
with restricted numbers of accumulators, is capa-
ble of better retrieval performance than the orig-
inal cosine method, in which all pages are per-
mitted accumulators. It would appear that the
mid to low weight terms, while contributing to
retrieval effectiveness, should not be permitted
to collaborate and select documents that contain
none of the more highly weighted terms.
184