NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Effective and Efficient Retrieval from Large and Dynamic Document Collections chapter D. Knaus P. Schauble National Institute of Standards and Technology D. K. Harman exact retrieval status values are computed as necessary. Assume that d7 is the first (i.e. d7 has the highest Rsv0), d13 is the second, and d5 is the third docu- ment as shown in Figure 1 In this case, RSV(q, d7) is computed first and next, RSV(q, d13) is computed. According to Figure 1, RSV(q, d13) is the highest ex- act retrieval status value of those two exact retrieval status values that are known at this moment. Fur- thermore, all unknown exact retrieval status values are less than or equal to RSV0(q, d5) because of (3). Since RSV(q, d13) > RSV0(q, d5), we can conclude that d13 has the highest exact retrieval status value of {Lll docu- ments. In this example, only two exact retrieval status values had to be computed to determine the top ranked document. In practice, more exact retrieval status val- ues have to be computed. The evaluation algorithm is based on an access strvc- tvre that specifies the following functions. d5 i. [OCRerr]d[OCRerr]) : [OCRerr] e d[OCRerr]} midf: tpj I nidf([OCRerr]) size: d5 d; The functions [OCRerr] and [OCRerr] determine the signatures and the non-inverted document descriptions respectively. The functions nidf and size provide scaling and normaliza- tion factors. The function values [OCRerr](d[OCRerr]) and [OCRerr](d5) are determined completely by the document d5 itself. The function values niŁ[OCRerr]f([OCRerr]) and size (d5), however, depend on the entire document collection. Fortunately, the vari- ation of nidf([OCRerr][OCRerr]) is small if the data collection is suf- ficiently large. Thus, nidf([OCRerr]) has to be recomputed only if the domain of the data collection is shifting and size(d5) has to be recomputed if either TLidf was recalcu- lated or d5 was modified. In the next section, we will see how this basic query evaluation scheme can be refined. 3 Refinements The basic evaluation algorithm described in Section 2 achieves an excellent update efficiency because the de- scription of a single document is stored in a compact form which can be updated efficiently. However, the retrieval effectiveness as well as the retrieval efficiency need further refinements in the case of the TREC experi- ment where extremely large queries are matched against a large document collection containing some large doc- uments. To improve the retrieval effective[OCRerr]ess, we adapted the basic evaluation algorithm to the best global weighting scheme achieved by the SMART system at TREC-1 [1]. The query features are "ltn" weighted, i.e. the query weights depend on the logarithm of the feature frequen- cies and on the inverse document frequencies, and they are not cosine normalized. The "lnc" weighted docu- ment features depend on the logarithm of the feature frequencies. They are cosine normalized, but they do not have a document frequency component. The de- tailed definitions of the feature weights are given in Ta- ble 1. In order to achieve a good retrieval efflcie[OCRerr]cv for the TREC collection we have reduced the indexing vocab- ulary in such a way that on one hand the retrieval efficiency is improved and on the other hand, the re- trieval effectiveness is not deteriorated very much. Our indexing vocabulary was not only reduced by apply- ing a stop word list and a word reduction algorithm, but also by eliminating further indexing terms. From another project [3] and from the experiences of the SMART system at TREC-1 [1] we knew that the re- moval of the indexing features having a very high doc- ument frequency does not deteriorate the retrieval ef- fectiveness very much. The reasons why this restriction (4) has a small effect on the retrieval effectiveness is the (5) following. The elimination of the high document fre- (6) quency features changes the retrieval status values very little because these features have a small inverse docu- (7) ment frequency and hence, they contribute little to the retrieval status values. Such an elimination of the high document frequency features is simil& to the applica- tion of a collection dependent stop word list in addition to a general stop word list. Thus, this restriction has only a small effect on the retrieval status values. In what follows, we focus on the retrieval efficiency and why it is improved by this restriction. With a re- duced indexing vocabulary, 1. fewer documents have a positive approximate re- trieval status value, 2. fewer features [OCRerr]j contribute an error (a[OCRerr]Q[OCRerr] - a[OCRerr][OCRerr]) * to the approximate retrieval status values, and 3. fewer false matches (caused by collisions) occur when computing the approximate retrieval status values. Both scanning the signatures and sorting the approxi- mate retrieval status values become faster when fewer approximate retrieval status values are positive. Fur- thermore, the approximate retrieval status values be- come tighter upper bounds when fewer false matches occur and when the sum of the errors ([OCRerr],Q5 - aij) * is reduced. Having tighter upper bounds means that fewer exact retrieval status values have to be computed until the top ranked document is determined. The re- trieval effectiveness and the retrieval efficiency achieved by means of the reduced indexing vocabulary are pre- sented in the next section. 165