SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Effective and Efficient Retrieval from Large and Dynamic Document Collections
chapter
D. Knaus
P. Schauble
National Institute of Standards and Technology
D. K. Harman
Effective and Efficient Retrieval from
Large and Dynamic Document Collections
Daniel Knaus, Peter Scha[OCRerr]uble
(knaus schauble)Šinf.ethz.ch
Swiss Federal Institute of Technology (ETH)
CH-8092 Zu[OCRerr] rich, Switzerland
Abstract
A new retrieval method together with a new access
structure is presented that is aimed at a high update
efficiency, a high retrieval efficiency and a high retrieval
effectiveness. The access structure consists of signatures
and non-inverted descriptions. This access structure can
be updated efficiently because the description of a single
document is stored in a compact form. The signatures
are used to compute approximate retrieval status values
first, and the non-invert'ed descriptions are then used to
determine the final list of documents ranked by the ex-
act retrieval status values. Our basic approach based
on the standard tf * idf weighting scheme has been im-
proved in in both retrieval effectiveness and retrieval ef-
ficiency. On an average, the time for retrieving the top
ranked document is clearly below two seconds while the
document collection can be updated in 10 msec. (insert-
ing, deleting, or modifying a document description).
1 Introduction
Information retrieval systems are more and more used
to retrieve information from l[OCRerr]rge [OCRerr] d[OCRerr]m[OCRerr]mic data
collections, e.g. in office automation or in newscast-
ing companies. Since in such environments, the data
collections may have to be updated many times every
second, the update efficiency is an important evaluation
criterion that should be taken into account in addition
to the traditional evaluation criteria (retrieval effective-
ness, retrieval efficiency, etc.) [10, pp.161], [12, pp.144].
Before introducing our approach, we show that the
update efficiency is conflicting with the retrieval effec-
tiveness and with the retrieval efficiency. In the case of
static data collections, the retrieval effectiveness crite-
rion is usually met by means of weighted retriev[OCRerr]l in-
cluding relevance feedback [4], [9] whereas the retrieval
efficiency criterion is met by precomputing the docu-
ment descriptions and by organizing these descriptions
as an inverted file [5]. In the case of dynamic data collec-
tions, however, a transaction updating the inverted file
163
may block other retrieval and update transactions for an
unacceptable long time because adding the description
of a new document to an inverted file is time consuming,
particularly if the document is long. A possible remedy
is to use signatures instead of an inverted file [2]. Sig-
natures can be updated efficiently; however, they do
not allow document feature weighting and hence, the
retrieval effectiveness of the signature based method is
inferior to a fully weighted retrieval method [8]. Thus,
it is difficult to achieve simultaneously high retrieval
effectiveness, high retrieval efficiency, and high update
efficiency.
In our approach, we focus on the retrieval from large
and dynamic data collections where we face the prob-
lem of achieving simultaneously high retrieval effective-
ness, high retrieval efficiency, and high update efficiency.
Addressing this problem is justified by the need for ap-
propriate retrieval capabilities in dynamic environments
such as in office automation or in newscasting compa-
nies. Within the SPIDER project [11] we have devel-
oped a retrieval method and an access structure which
supports fully weighted retrieval and which achieves
short response times even when the collection is updated
several times every second. In Section 2, we describe our
basic approach, in Section 3, we focus on refinements of
this basic approach, in Section 4, we present the results,
and in Section 5, we draw some conclusions.
2 SPIDER's Query Evaluation
Algorithm
In this section, we present a retrieval method to-
gether with a new access structure facilitating both fast
weighted retrieval and efficient updates of the access
structure. A preliminary version was presented in [11].
Our access structure consists of 8ig[OCRerr][OCRerr]t[OCRerr]res CL[OCRerr]d ThOTL-
imverted descvi[OCRerr]tio[OCRerr]s of the documents. The signatures
are used to compute approximate retrieval status values
RSV0(q, d[OCRerr]) first and the non-inverted document de-
scriptions are then used to determine the exact retrieval
status values RSV(q, d[OCRerr]). The trick is that the approx-