NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Effective and Efficient Retrieval from Large and Dynamic Document Collections chapter D. Knaus P. Schauble National Institute of Standards and Technology D. K. Harman imate retrieval status values provide fairly tight upper boumda for the exact retrieval status values. We will see how we can take advantage of these upper bounds such that only a few exact retrieval status values have to be computed. The retrieval method determining the approximate retrieval status values and the retrieval method deter- mining the exact retrieval status values are given by the functions RSV0 and RSV respectively. Let R[OCRerr]V0 d7 - d13- R V S {[OCRerr]O[OCRerr] .,`Pm-1} -d13 be the indexing vocabulary (e.g. a set of terms) and let d5 - - d7 D := [OCRerr] be the current document collection. We assume that the signatures [OCRerr](d5) consist of w bits where the bit at position p has the value [OCRerr](d5)[p]. Every indexing feature [OCRerr] is assigned a bit position p = h([OCRerr]) by means of a hash function h([OCRerr]). The function h specifies a signature [OCRerr](d5) for every doc- ument d5 by setting the bit at position p iff d[OCRerr] contains a feature [OCRerr] which is hashed to this position. { Ol if[OCRerr][OCRerr] Ed[OCRerr] :h([OCRerr])=p otherwise We now define the approximate retrieval staLus value RSV0(q, d5) and the exact retrieval status value RSV(q,d5) as follows. Figure 1: Computation of the exact retrieval status val- ues in decreasing order of the approximate retrieval sta- tus values. (2). The TLidf([OCRerr])'s are determined by the document frequencies df([OCRerr][OCRerr]). The normalizations of the feature frequency component and the normalizations of the doc- ument frequency component do not affect the retrieval status values because they cancel out when using the cosine measure. Because of these normahzations, the approximate document weights a?5 are always upper bounds of the exact document weights aij, but they do not depend on the feature frequencies ff('p[OCRerr], d5). O<ajj<[OCRerr]a,9j 0 <b[OCRerr] 1 _* RSV0(q,d[OCRerr]) := Id[OCRerr] I a,95 *b[OCRerr] (1) It is easy [OCRerr](d5) = 0. implies r(d5 to show that RSV(q,d5) = 0 if [OCRerr](q) A Furthermore, from the fact that [OCRerr] E d5 =1 and from a,[OCRerr] < a,9[OCRerr] follows 1 RSV(q,d5):=[OCRerr][OCRerr]* [OCRerr] a[OCRerr][OCRerr]*bj (2) RSV(q,d5) < RSV0(q,d5) (3) I d51 [OCRerr][OCRerr]Eq:[OCRerr]Ed[OCRerr] where a,9[OCRerr] denote the approximate weights of document features, aij denote the exact document weights, and b[OCRerr] denote the query weights. The size Id; I of the descri[OCRerr] tion vector d; is defined as usual (see Table 1). Our basic approach consists of a tf * idf weighting scheme, i.e. the query features are "ntn" weighted whereas the document features are "ntc" weighted (see Table 1). The query weights b[OCRerr] depend on the feature frequen- cies ff([OCRerr]i, q) and on the normalized inverse document frequencies nidf(tp[OCRerr]). The exact document weights a,,' depend on the feature frequencies ff([OCRerr]i, d5) and on the nidf([OCRerr]). In addition, they (aij) are cosine normalized, which is expressed by the division by d4 I in (1) and 164 In what follows, we describe the evaluation algorithm by means of an example. Let q be the user's query. In a first step, bitwise AND-operations are performed with the query signature [OCRerr](q) and the signature [OCRerr](d5) of every document d5 E D. If [OCRerr](q) A [OCRerr](d5) is non-zero then the approximate retrieval status value RSV0(q, d5) is computed. The first step is finished by ordering the documents in decreasing order of the approximate re- trieval status values. Remember that [OCRerr](q) A [OCRerr](d5) = 0 implies that RSV(q, d5) = 0 and hence, d5 can be ig- nored in the sequel. In the second step, the exact retrieval status values are computed in the order that was previously deter- mined by means of the approximate retrieval status val- ues. As shown in the following example, only as many