SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman Tradeoff runs STANDARD 1. ntc.ntc (single terms) Full 2 pass indexing 2. ntc.ntc (single terms) alternate indexing method making document vectors STOPWORD 3. ntc.ntc automatic stopword (added 69 terms occurring in 10\X of coil) 4. ntc.ntc automatic stopword (added 350 terms occurring in 5\e/e of coil) 5. ntc.ntc automatic stopword (added 1286 terms occurring in 2\Y, of coil) STEMMING 6. ntc.ntc only plural stemming 7. ntc.ntc no stems LOCAL/GLOBAL local/global (single terms) local/global (single terms) same thresholds as 2nd official run) QUERY OPTIMIZATION query efficiency optimization (15 docs guaranteed good) PHRASES 11. ntc.ntc phrase dictionary. (> 25 times in Dl, 158,000 out of 4.7 million) *12. ntc.ntc local/global (phrases) *8. ntc.ntc 9. ntc.ntc 10. ntc.ntc 13. nnc.ntc 14. lnc.ltc 15. lnc.ltc Doc Indexing Time (hours) 4.5/4.9 4.7/0.7 4.3/4.6 4.0/4.3 3.7/3.9 4.3/5.0 4.2/4.7 4.7/0.7 liii ii Ii 1. 2. 3. 4. 5. 6. 7. *8. 9. 10. 11. *12 13. 14. 15. OTHER WEIGHTS (single terms) (single terms) (phrases) Query Inverted Other Retrieval Speed 50 queries es) ) (seconds) 358 I'll Indexing File Time Size (seconds) (Mbyt 2.3(13.6) 667 2.7 3.2(13.1) 624 3.0(12.8) 528 2.8 381 2.7 724 1.6 752 2.7 667 liii lii lii' liii liii 7.5/8.0 3.8 9.7/0.9 2.7 4.5 (88.5) 4.5 2.7 8.1 File Size (Mbytes 100 790 100 100 100 98 98 790 `III 892 104 892 1040 667 89 liii liii 892 104 **: timing * indicates official TREC run, Retrieval-Effectiveness (averaged over 47 queries) 11-pt NumRel Total 1813 3114 liii liii Recall/prec at 200 2614/3313 ii 306 1828 3101 2587/3299 166 1750 2978 2524/3168 78 1538 2658 2237/2828 251 1745 3148 2605/3349 235 1709 3101 2545/3299 1465 1783 3150 2636/3351 I'll 1982 3400 2856/3617 97 1693 2983 2476/3173 415 1903 3298 2814/3509 2405 2080 3555 3076/3782 262** 1818 3203 2614/3407 2249 3746 3272/3985 396 2424 3886 3394/4134 on machine with 128 Mbyte memory Query timing numbers in parenthesis indicate CPU time using dictionary on disk 66