SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Effective and Efficient Retrieval from Large and Dynamic Document Collections
chapter
D. Knaus
P. Schauble
National Institute of Standards and Technology
D. K. Harman
imate retrieval status values provide fairly tight upper
boumda for the exact retrieval status values. We will see
how we can take advantage of these upper bounds such
that only a few exact retrieval status values have to be
computed.
The retrieval method determining the approximate
retrieval status values and the retrieval method deter-
mining the exact retrieval status values are given by the
functions RSV0 and RSV respectively. Let
R[OCRerr]V0
d7 -
d13-
R V
S
{[OCRerr]O[OCRerr] .,`Pm-1} -d13
be the indexing vocabulary (e.g. a set of terms) and let
d5 -
- d7
D := [OCRerr]
be the current document collection. We assume that
the signatures [OCRerr](d5) consist of w bits where the bit at
position p has the value [OCRerr](d5)[p].
Every indexing feature [OCRerr] is assigned a bit position p =
h([OCRerr]) by means of a hash function h([OCRerr]).
The function h specifies a signature [OCRerr](d5) for every doc-
ument d5 by setting the bit at position p iff d[OCRerr] contains
a feature [OCRerr] which is hashed to this position.
{ Ol
if[OCRerr][OCRerr] Ed[OCRerr] :h([OCRerr])=p
otherwise
We now define the approximate retrieval staLus
value RSV0(q, d5) and the exact retrieval status value
RSV(q,d5) as follows.
Figure 1: Computation of the exact retrieval status val-
ues in decreasing order of the approximate retrieval sta-
tus values.
(2). The TLidf([OCRerr])'s are determined by the document
frequencies df([OCRerr][OCRerr]). The normalizations of the feature
frequency component and the normalizations of the doc-
ument frequency component do not affect the retrieval
status values because they cancel out when using the
cosine measure.
Because of these normahzations, the approximate
document weights a?5 are always upper bounds of the
exact document weights aij, but they do not depend on
the feature frequencies ff('p[OCRerr], d5).
O<ajj<[OCRerr]a,9j
0 <b[OCRerr]
1
_*
RSV0(q,d[OCRerr]) := Id[OCRerr] I
a,95 *b[OCRerr] (1) It is easy
[OCRerr](d5) = 0.
implies r(d5
to show that RSV(q,d5) = 0 if [OCRerr](q) A
Furthermore, from the fact that [OCRerr] E d5
=1 and from a,[OCRerr] < a,9[OCRerr] follows
1
RSV(q,d5):=[OCRerr][OCRerr]* [OCRerr] a[OCRerr][OCRerr]*bj (2)
RSV(q,d5) < RSV0(q,d5) (3)
I d51 [OCRerr][OCRerr]Eq:[OCRerr]Ed[OCRerr]
where a,9[OCRerr] denote the approximate weights of document
features, aij denote the exact document weights, and b[OCRerr]
denote the query weights. The size Id; I of the descri[OCRerr]
tion vector d; is defined as usual (see Table 1).
Our basic approach consists of a tf * idf weighting
scheme, i.e. the query features are "ntn" weighted
whereas the document features are "ntc" weighted (see
Table 1).
The query weights b[OCRerr] depend on the feature frequen-
cies ff([OCRerr]i, q) and on the normalized inverse document
frequencies nidf(tp[OCRerr]). The exact document weights a,,'
depend on the feature frequencies ff([OCRerr]i, d5) and on the
nidf([OCRerr]). In addition, they (aij) are cosine normalized,
which is expressed by the division by d4 I in (1) and
164
In what follows, we describe the evaluation algorithm
by means of an example. Let q be the user's query.
In a first step, bitwise AND-operations are performed
with the query signature [OCRerr](q) and the signature [OCRerr](d5)
of every document d5 E D. If [OCRerr](q) A [OCRerr](d5) is non-zero
then the approximate retrieval status value RSV0(q, d5)
is computed. The first step is finished by ordering the
documents in decreasing order of the approximate re-
trieval status values. Remember that [OCRerr](q) A [OCRerr](d5) = 0
implies that RSV(q, d5) = 0 and hence, d5 can be ig-
nored in the sequel.
In the second step, the exact retrieval status values
are computed in the order that was previously deter-
mined by means of the approximate retrieval status val-
ues. As shown in the following example, only as many