SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Effective and Efficient Retrieval from Large and Dynamic Document Collections
chapter
D. Knaus
P. Schauble
National Institute of Standards and Technology
D. K. Harman
exact retrieval status values are computed as necessary.
Assume that d7 is the first (i.e. d7 has the highest
Rsv0), d13 is the second, and d5 is the third docu-
ment as shown in Figure 1 In this case, RSV(q, d7)
is computed first and next, RSV(q, d13) is computed.
According to Figure 1, RSV(q, d13) is the highest ex-
act retrieval status value of those two exact retrieval
status values that are known at this moment. Fur-
thermore, all unknown exact retrieval status values are
less than or equal to RSV0(q, d5) because of (3). Since
RSV(q, d13) > RSV0(q, d5), we can conclude that d13
has the highest exact retrieval status value of {Lll docu-
ments. In this example, only two exact retrieval status
values had to be computed to determine the top ranked
document. In practice, more exact retrieval status val-
ues have to be computed.
The evaluation algorithm is based on an access strvc-
tvre that specifies the following functions.
d5 i. [OCRerr]d[OCRerr])
: [OCRerr] e d[OCRerr]}
midf: tpj I nidf([OCRerr])
size: d5 d;
The functions [OCRerr] and [OCRerr] determine the signatures and the
non-inverted document descriptions respectively. The
functions nidf and size provide scaling and normaliza-
tion factors. The function values [OCRerr](d[OCRerr]) and [OCRerr](d5) are
determined completely by the document d5 itself. The
function values niŁ[OCRerr]f([OCRerr]) and size (d5), however, depend
on the entire document collection. Fortunately, the vari-
ation of nidf([OCRerr][OCRerr]) is small if the data collection is suf-
ficiently large. Thus, nidf([OCRerr]) has to be recomputed
only if the domain of the data collection is shifting and
size(d5) has to be recomputed if either TLidf was recalcu-
lated or d5 was modified. In the next section, we will see
how this basic query evaluation scheme can be refined.
3 Refinements
The basic evaluation algorithm described in Section 2
achieves an excellent update efficiency because the de-
scription of a single document is stored in a compact
form which can be updated efficiently. However, the
retrieval effectiveness as well as the retrieval efficiency
need further refinements in the case of the TREC experi-
ment where extremely large queries are matched against
a large document collection containing some large doc-
uments.
To improve the retrieval effective[OCRerr]ess, we adapted the
basic evaluation algorithm to the best global weighting
scheme achieved by the SMART system at TREC-1 [1].
The query features are "ltn" weighted, i.e. the query
weights depend on the logarithm of the feature frequen-
cies and on the inverse document frequencies, and they
are not cosine normalized. The "lnc" weighted docu-
ment features depend on the logarithm of the feature
frequencies. They are cosine normalized, but they do
not have a document frequency component. The de-
tailed definitions of the feature weights are given in Ta-
ble 1.
In order to achieve a good retrieval efflcie[OCRerr]cv for the
TREC collection we have reduced the indexing vocab-
ulary in such a way that on one hand the retrieval
efficiency is improved and on the other hand, the re-
trieval effectiveness is not deteriorated very much. Our
indexing vocabulary was not only reduced by apply-
ing a stop word list and a word reduction algorithm,
but also by eliminating further indexing terms. From
another project [3] and from the experiences of the
SMART system at TREC-1 [1] we knew that the re-
moval of the indexing features having a very high doc-
ument frequency does not deteriorate the retrieval ef-
fectiveness very much. The reasons why this restriction
(4) has a small effect on the retrieval effectiveness is the
(5) following. The elimination of the high document fre-
(6) quency features changes the retrieval status values very
little because these features have a small inverse docu-
(7) ment frequency and hence, they contribute little to the
retrieval status values. Such an elimination of the high
document frequency features is simil& to the applica-
tion of a collection dependent stop word list in addition
to a general stop word list. Thus, this restriction has
only a small effect on the retrieval status values.
In what follows, we focus on the retrieval efficiency
and why it is improved by this restriction. With a re-
duced indexing vocabulary,
1. fewer documents have a positive approximate re-
trieval status value,
2. fewer features [OCRerr]j contribute an error (a[OCRerr]Q[OCRerr] - a[OCRerr][OCRerr]) *
to the approximate retrieval status values, and
3. fewer false matches (caused by collisions) occur
when computing the approximate retrieval status
values.
Both scanning the signatures and sorting the approxi-
mate retrieval status values become faster when fewer
approximate retrieval status values are positive. Fur-
thermore, the approximate retrieval status values be-
come tighter upper bounds when fewer false matches
occur and when the sum of the errors ([OCRerr],Q5 - aij) *
is reduced. Having tighter upper bounds means that
fewer exact retrieval status values have to be computed
until the top ranked document is determined. The re-
trieval effectiveness and the retrieval efficiency achieved
by means of the reduced indexing vocabulary are pre-
sented in the next section.
165