NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman wSjI - docs: 99111, terms: ndim: 250 API - docs: 8493([OCRerr], terms: 78167, ndim: 25([OCRerr] ZWFI - docs: 7518(), terms: 6()565, ndim: 250 FRI - docs: 26207, terms: 54713, ndim: 25() W5j2 - docs: 7452[OCRerr]), terms: ndim: 235 AP2 - docs: 79923, terms: 82997, ndim: 235 ZIFF2 - docs: 5692([OCRerr], terms: 72197, ndim: 235 FR2 - docs: terms: 48728, ndim: 235 -> 169 meg dl) -> 163 meg dl) -> 135 meg dl) -> 80 meg dl) -> 141 meg dl) -> 153 meg dl) -> 121 meg dl) -> 64 meg dl) Used 25() dims fi)r routing and 235 dims for ad hoc ([OCRerr]uerjes In general, database size will be: (ndocs+nterms)*ndim*4 The totals here are 1288 meg (750000 docs and 585([OCRerr]00 terms). If a single database had been used, the total would have been smaller becauSe of term overlap--currently, many of the terms are represented in more than oliC datal)ase; there are only 2000(M) Uni(Jue terms. b. t()t[OCRerr]l Computer tilne to build ([OCRerr][OCRerr]ppr()xilnate iiuinber of hours) Four main stages: I. indexing (extracting keys; calculating wts; etc.) 2. SVD (number [OCRerr])f d1111C1151()fl5 extracted ranged from 235-310) NOTE 1: only 235-25() dims were actually used fi)r retrieval. I don't have timing data for extracting only this smaller numl)er of dimensions, but I'd estimate that the numbers t[OCRerr])r APi, ZIFFi and FRi could l)C reduced by about 20%. NOTE 2: initial indexing and SVD are typically done on a subset of 50()00 docs and uterms 3. various i/o translations (much ([OCRerr]f this will g(j away soon) 4. adding new docs to dl)a5e (if sul)-sampled for SVD). SVI) done oil 5([OCRerr](H)() docs; the remaining docs are indexed and added to the datal)ase after the SVl). all times in [OCRerr]1INUTES (SVD DOEI - index: 49 SVD: 1219 io: wSj1 - index: 241 SVD: 1474 i(): APi - index: 271 SVD: 1644 i(): ZIFFi - index: 241 SVD: 1359 i(): FRi - index: 241 SVD: 939 io: WSJ2 - index: 427 SVD: 1382 io: AP2 - index: 338 SVD: 1210 i(): ZIFF2 - index: 260 SVD: 1452 i(): FR2 - index: 187 SVD: 486 io: run on DECS()()(); rest on SPARC2) 194 add: 591 SUM: 2053 mins 174 add: 4()4 SUM: 2293 mins 214 add: 455 SUM: 2584 mins 156 add: 352 SUM: 2108 mins 133 add: 0 SUM: 1313 mins 22(J add: 461 SUM: 2490 mins 218 add: 273 SUM: 2([OCRerr]9 mins 2()8 add: 0 SUM: 1920 mins loS add: 0 SUM: 778 mins C. i.[OCRerr] the pr([OCRerr]es.[OCRerr] Completely [OCRerr]ut()Jn[OCRerr]'ttiC? YES d. brief deNcription of Ineth()d.[OCRerr] u.[OCRerr]ed LSI/SVD analysis ([OCRerr]f document collection 1. creates raw term-l)y.d()c matrix; transf[OCRerr])rms entries using log entropy term weightings 2. calculates beSt "red uced-dimensional" approximation to transformed matrix using SV1). Number of dimensions in the SVD calculations ranged fi[OCRerr]()m 235 to 3I([OCRerr]. BUT, only 235 ()[OCRerr] 250 were used f(Jr the comparisons. Fewer dims could have been calculated, So Some reported SVD times are higher than necessary. I'd estimate about 2()% reductions in SVD times for API, ZIFFI, and FRI. 3. perf[OCRerr])rm various datal)ase translations. Current SVD program outputs vectors in a different f([OCRerr]rmat and order than we need for the database. It 468