SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman [ Recall 110% [OCRerr] 20% f 30% [OCRerr] 40% [OCRerr] 50% [OCRerr] 60% [OCRerr] 70% [OCRerr] 80% [OCRerr] 90% 1100% [OCRerr] Av. J V1 0.209 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.040 V2 0.209 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.040 V3 0.191 0.114 0.034 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.038 V4 0.198 0.115 0.037 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.039 V5 0.192 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.038 V6 0.198 0.111 0.045 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039 Table 7: All field ranking Using this information, we tried to combine the various fields in "creative" ways. In Table 8 we show the results of creating a combined vector described by v7 = Wt+2Vd+3V[OCRerr]+O.Wf+3Vc+2Vp v8 = Wt+lVd+W[OCRerr]+O.Wf+3V[OCRerr]+3V[OCRerr] Recall 10 o 20 o 30 o 40% 50% 60% 70% 80% 90% 100% Av. V7 0.208 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039 V8 0.205 0.112 0.033 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039 Table 8: Combining the fields The first thing to note is how similar the results are. Since a limited number of documents are being examined using an imperfect Boolean algorithm, many relevant documents are being missed altogether, so the ranking formulas have no possibility of giving them high scores. An alternative technique for evaluating ranking formulas is to determine precision after fixed numbers of documents have been examined. This has the advantage that large numbers of relevant documents not identified by the Boolean algorithm do not flatten out recall/precision results. Table 9 gives results at twenty document intervals for all previous experiments. DocumentsI 5 [OCRerr] 15 [30 1100 [OCRerr]200IAv.j Vd 0.374 0.321 0.298 0.223 0.158 0.275 V[OCRerr] 0.319 0.295 0.279 0.225 0.159 0.255 Va 0.349 0.316 0.305 0.224 0.169 0.273 V1 0.387 0.352 0.336 0.237 0.166 0.296 V2 0.391 0.352 0.336 0.237 0.166 0.296 Va 0.349 0.316 0.305 0.224 0.169 0.273 V4 0.349 0.305 0.304 0.230 0.171 0.272 V[OCRerr] 0.319 0.295 0.279 0.225 0.159 0.255 V6 0.336 0.304 0.284 0.229 0.164 0.263 V7 0.404 0.367 0.334 0.236 0.166 0.302 V8 0.370 0.350 0.327 0.238 0.165 0.290 Table 9: Comparison of ranking formula for fixed no. of documents returned 239