SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Effective and Efficient Retrieval from Large and Dynamic Document Collections
chapter
D. Knaus
P. Schauble
National Institute of Standards and Technology
D. K. Harman
Precision
1
Ml inc.ltn.all -
0.8 MO ntc.ntn.all *
0.6
0.4
average median response
0.2
precision time for 1. rank
0
0 0.2 0.4 0.6 0.8 1
Recall
0
0.33-
- 2.2 sec.
Figure 2: Precision recall graphs of the most effective
method (Ml) and of the least effective method (Mo).
effective than the "ntn" weighting (10-15%). Restrict-
ing the vocabulary results in a 2-7% lower precision. We
can summ&ize the experiences as follows:
* The rndf([OCRerr])'s in the document feature weights have
a bad influence on the retrieval effectiveness. It could
be that for long documents the estimation of the
lengths d5 is inappropriate when the nidf([OCRerr])'s are
taken into account.
* Logarithmic feature weighting is more appropriate
for the long TREC documents and queries than lin-
ear feature weighting. Logarithmic feature weighting
avoids an overweighting of features occurring very fre-
quently within a document.
* Restricting the indexing vocabulary by ommitting fea-
tures with a high document frequency df([OCRerr]) has a
noticeable influence on the average precision.
In what follows, we discuss the i¶LflueThceB of the diffe?-
emt p[OCRerr]?CLmeter8 om the reapom[OCRerr]e time (as shown in Fig-
ure 3). Restricting the vocabulary accelerates the query
evaluation (by 9-14%) for the reasons described in Sec-
tion 2. The "ntn" weighting of the query features is
also 9-14% faster than the "ltn" weighting. For doc-
ument feature weighting3 the "lnc" weighting is 5-10%
slower than the "ltc" weighting. The "ntc" weighting of
is even slower. These results can be explained in terms
of the approximation error.
* It is obvious that for the "ltc" weighting the approxi-
mation error (a?5 aij )*b[OCRerr] is smaller than for the "lnc"
167
Ml lnc.ltn.all - 2.1 sec.
2.0 sec.
M2 lnc.ltn.dflS
0.30- 1.9 sec.
MS ltc.ltn.all
1.8 sec.
M3 lnc.ntn all
M6 ltc.ltn.df15[OCRerr] 1.7 sec.
0.27- 1.6 sec.
M4 inc.ntn.dflS
M7 ltc.ntn.all[OCRerr]
1.5 sec.
M8 ltc.ntn.dflS 7¾>
MO ntc.ntn.all[OCRerr]
0.24-
- 1.4 sec.
1.3 sec.
0
Figure 3: Average precisions and response times of the
first ranked document.