SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
[ Recall 110% [OCRerr] 20% f 30% [OCRerr] 40% [OCRerr] 50% [OCRerr] 60% [OCRerr] 70% [OCRerr] 80% [OCRerr] 90% 1100% [OCRerr] Av. J
V1 0.209 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.040
V2 0.209 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.040
V3 0.191 0.114 0.034 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.038
V4 0.198 0.115 0.037 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.039
V5 0.192 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.038
V6 0.198 0.111 0.045 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039
Table 7: All field ranking
Using this information, we tried to combine the various fields in "creative" ways. In Table 8
we show the results of creating a combined vector described by
v7 = Wt+2Vd+3V[OCRerr]+O.Wf+3Vc+2Vp
v8 = Wt+lVd+W[OCRerr]+O.Wf+3V[OCRerr]+3V[OCRerr]
Recall 10 o 20 o 30 o 40% 50% 60% 70% 80% 90% 100% Av.
V7 0.208 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039
V8 0.205 0.112 0.033 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039
Table 8: Combining the fields
The first thing to note is how similar the results are. Since a limited number of documents
are being examined using an imperfect Boolean algorithm, many relevant documents are being
missed altogether, so the ranking formulas have no possibility of giving them high scores.
An alternative technique for evaluating ranking formulas is to determine precision after fixed
numbers of documents have been examined. This has the advantage that large numbers of
relevant documents not identified by the Boolean algorithm do not flatten out recall/precision
results. Table 9 gives results at twenty document intervals for all previous experiments.
DocumentsI 5 [OCRerr] 15 [30 1100 [OCRerr]200IAv.j
Vd 0.374 0.321 0.298 0.223 0.158 0.275
V[OCRerr] 0.319 0.295 0.279 0.225 0.159 0.255
Va 0.349 0.316 0.305 0.224 0.169 0.273
V1 0.387 0.352 0.336 0.237 0.166 0.296
V2 0.391 0.352 0.336 0.237 0.166 0.296
Va 0.349 0.316 0.305 0.224 0.169 0.273
V4 0.349 0.305 0.304 0.230 0.171 0.272
V[OCRerr] 0.319 0.295 0.279 0.225 0.159 0.255
V6 0.336 0.304 0.284 0.229 0.164 0.263
V7 0.404 0.367 0.334 0.236 0.166 0.302
V8 0.370 0.350 0.327 0.238 0.165 0.290
Table 9: Comparison of ranking formula for fixed no. of documents returned
239