SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Effective and Efficient Retrieval from Large and Dynamic Document Collections chapter D. Knaus P. Schauble National Institute of Standards and Technology D. K. Harman Effective and Efficient Retrieval from Large and Dynamic Document Collections Daniel Knaus, Peter Scha[OCRerr]uble (knaus schauble)Šinf.ethz.ch Swiss Federal Institute of Technology (ETH) CH-8092 Zu[OCRerr] rich, Switzerland Abstract A new retrieval method together with a new access structure is presented that is aimed at a high update efficiency, a high retrieval efficiency and a high retrieval effectiveness. The access structure consists of signatures and non-inverted descriptions. This access structure can be updated efficiently because the description of a single document is stored in a compact form. The signatures are used to compute approximate retrieval status values first, and the non-invert'ed descriptions are then used to determine the final list of documents ranked by the ex- act retrieval status values. Our basic approach based on the standard tf * idf weighting scheme has been im- proved in in both retrieval effectiveness and retrieval ef- ficiency. On an average, the time for retrieving the top ranked document is clearly below two seconds while the document collection can be updated in 10 msec. (insert- ing, deleting, or modifying a document description). 1 Introduction Information retrieval systems are more and more used to retrieve information from l[OCRerr]rge [OCRerr] d[OCRerr]m[OCRerr]mic data collections, e.g. in office automation or in newscast- ing companies. Since in such environments, the data collections may have to be updated many times every second, the update efficiency is an important evaluation criterion that should be taken into account in addition to the traditional evaluation criteria (retrieval effective- ness, retrieval efficiency, etc.) [10, pp.161], [12, pp.144]. Before introducing our approach, we show that the update efficiency is conflicting with the retrieval effec- tiveness and with the retrieval efficiency. In the case of static data collections, the retrieval effectiveness crite- rion is usually met by means of weighted retriev[OCRerr]l in- cluding relevance feedback [4], [9] whereas the retrieval efficiency criterion is met by precomputing the docu- ment descriptions and by organizing these descriptions as an inverted file [5]. In the case of dynamic data collec- tions, however, a transaction updating the inverted file 163 may block other retrieval and update transactions for an unacceptable long time because adding the description of a new document to an inverted file is time consuming, particularly if the document is long. A possible remedy is to use signatures instead of an inverted file [2]. Sig- natures can be updated efficiently; however, they do not allow document feature weighting and hence, the retrieval effectiveness of the signature based method is inferior to a fully weighted retrieval method [8]. Thus, it is difficult to achieve simultaneously high retrieval effectiveness, high retrieval efficiency, and high update efficiency. In our approach, we focus on the retrieval from large and dynamic data collections where we face the prob- lem of achieving simultaneously high retrieval effective- ness, high retrieval efficiency, and high update efficiency. Addressing this problem is justified by the need for ap- propriate retrieval capabilities in dynamic environments such as in office automation or in newscasting compa- nies. Within the SPIDER project [11] we have devel- oped a retrieval method and an access structure which supports fully weighted retrieval and which achieves short response times even when the collection is updated several times every second. In Section 2, we describe our basic approach, in Section 3, we focus on refinements of this basic approach, in Section 4, we present the results, and in Section 5, we draw some conclusions. 2 SPIDER's Query Evaluation Algorithm In this section, we present a retrieval method to- gether with a new access structure facilitating both fast weighted retrieval and efficient updates of the access structure. A preliminary version was presented in [11]. Our access structure consists of 8ig[OCRerr][OCRerr]t[OCRerr]res CL[OCRerr]d ThOTL- imverted descvi[OCRerr]tio[OCRerr]s of the documents. The signatures are used to compute approximate retrieval status values RSV0(q, d[OCRerr]) first and the non-inverted document de- scriptions are then used to determine the exact retrieval status values RSV(q, d[OCRerr]). The trick is that the approx-