SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
2.2 Management of accumulators
An effective way to rank documents against a
query is with the cosine measure, defined by the
function
Wq,t Wd,t
Gd [OCRerr]q2,t[OCRerr][OCRerr]2,t'
where q is the query, d is the document, and W[OCRerr],t
is the weight of word i in document or query x.
We assume that Wd = V(Z[OCRerr] wd2,t) is the length
of document d, and that the top r answers are to
be retrieved and returned to the user.
In our experiments term weights were calcu-
lated using the IDE rule Wd,t = fd,t log(N/ft),
where fd,t is the number of appearances of term
tin document d, N is the number of documents
in the collection, and ft is the number of docu-
ments that contain term t. Harman [3] gives a
summary of ranking techniques and a discussion
of the cosine measure and IDE weighting rule.
To support the cosine measure in an inverted
file text database system, each inverted file entry
contains a sequence of (d, fd,t) pairs for some term
i. The value ft, the number of documents con-
taining t, is stored in the lexicon, and from these
two the cosine contribution fd,t (log(N/ft))2 of
term tin document d can be calculated. A struc-
ture of accumulators is used to collect these con-
tributions. As the inverted file entry for each
query term is processed, the accumulator value
for each document number in the inverted file
entry is updated by adding in each partial co-
sine value as it is calculated, so that by the time
all inverted file entries have been processed the
accumulator for document d contains the value
fd,t (log(N/ft))2 = Wqt Wd,t.
The number of accumulators required to pro-
cess a query can be large. Queries were created
from TREC topics 51-100 by simply extracting
and stemming all of the alphanumeric strings,
and stopping 601 closed-class words. The gen-
erated queries had an average of over 40 terms,
and each resulted in about 75% of the pages
of TREC having a non-zero accumulator value.
With this ratio the most memory-efficient accu-
mulator structure is an array. But at 32 or 64
bits per accumulator, this structure is formidably
large, and dominates the retrieval-time mem-
ory requirements. Even initialising it is time-
consuming.
1. Order the words in the query from highest
weight to lowest.
2. Set A $; A is the current set of accumula-
tors.
3. For each word tin the query,
(a)
(b)
Retrieve It, the inverted file entry for t.
For each (d, [OCRerr] pair in It,
i. If the accumulator Ad E A, calcu-
late Ad Ad + wq,t Wd,t.
ji. Otherwise, set A A+{Ad}, cal-
culate Ad Wq,t Wd,t.
(c) If Al >[OCRerr] L, go to step 4.
4. For each document d such that Ad C A, cal-
culate Cd Ad/Wd.
5. Identify the r highest values of Cd using a
heap.
Figure 2: Quit algorithm for computing cosine
using approximately L accumulators
A simple strategy for restricting the number of
accumulators is to order query terms by decreas-
ing weight, and only process terms until some
designated termination condition is met. One
such condition is to impose an a priori bound L
on the number of non-zero accumulators, and to
stop processing terms when this bound is reached.
Such a ranking process, which we call quit, is il-
lustrated in Figure 2. Other possible termination
conditions would be to place a limit on the num-
ber of terms considered or on the total number
of pointers decoded; or to place an upper bound
on the term frequency ft, and only process terms
that appear in fewer than x% of the documents,
for some predetermined value x. Buckley and Le-
wit [1]; Lucarella [8]; and Wong and Lee [13] have
also considered various stopping rul[OCRerr]s, and derive
conditions under which inverted file entries can be
discarded without affecting the r top documents
(although their order within the ranking may be
different). Here we are prepared to allow approx-
imate as well as exact rules, acknowledging that
the cosine measure is itself a heuristic.
Quitting has the advantage of only process-
ing the inverted file entries of a subset of the
query terms, and hence of faster ranking, but
at the possible expense of poor retrieval perfor-
mance, depending upon how discriminating the
low weighted terms are.
An alternative is to continue the processing of
183