NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Effective and Efficient Retrieval from Large and Dynamic Document Collections chapter D. Knaus P. Schauble National Institute of Standards and Technology D. K. Harman Precision 1 Ml inc.ltn.all - 0.8 MO ntc.ntn.all * 0.6 0.4 average median response 0.2 precision time for 1. rank 0 0 0.2 0.4 0.6 0.8 1 Recall 0 0.33- - 2.2 sec. Figure 2: Precision recall graphs of the most effective method (Ml) and of the least effective method (Mo). effective than the "ntn" weighting (10-15%). Restrict- ing the vocabulary results in a 2-7% lower precision. We can summ&ize the experiences as follows: * The rndf([OCRerr])'s in the document feature weights have a bad influence on the retrieval effectiveness. It could be that for long documents the estimation of the lengths d5 is inappropriate when the nidf([OCRerr])'s are taken into account. * Logarithmic feature weighting is more appropriate for the long TREC documents and queries than lin- ear feature weighting. Logarithmic feature weighting avoids an overweighting of features occurring very fre- quently within a document. * Restricting the indexing vocabulary by ommitting fea- tures with a high document frequency df([OCRerr]) has a noticeable influence on the average precision. In what follows, we discuss the i¶LflueThceB of the diffe?- emt p[OCRerr]?CLmeter8 om the reapom[OCRerr]e time (as shown in Fig- ure 3). Restricting the vocabulary accelerates the query evaluation (by 9-14%) for the reasons described in Sec- tion 2. The "ntn" weighting of the query features is also 9-14% faster than the "ltn" weighting. For doc- ument feature weighting3 the "lnc" weighting is 5-10% slower than the "ltc" weighting. The "ntc" weighting of is even slower. These results can be explained in terms of the approximation error. * It is obvious that for the "ltc" weighting the approxi- mation error (a?5 aij )*b[OCRerr] is smaller than for the "lnc" 167 Ml lnc.ltn.all - 2.1 sec. 2.0 sec. M2 lnc.ltn.dflS 0.30- 1.9 sec. MS ltc.ltn.all 1.8 sec. M3 lnc.ntn all M6 ltc.ltn.df15[OCRerr] 1.7 sec. 0.27- 1.6 sec. M4 inc.ntn.dflS M7 ltc.ntn.all[OCRerr] 1.5 sec. M8 ltc.ntn.dflS 7¾> MO ntc.ntn.all[OCRerr] 0.24- - 1.4 sec. 1.3 sec. 0 Figure 3: Average precisions and response times of the first ranked document.