SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
wSjI - docs: 99111, terms: ndim: 250
API - docs: 8493([OCRerr], terms: 78167, ndim: 25([OCRerr]
ZWFI - docs: 7518(), terms: 6()565, ndim: 250
FRI - docs: 26207, terms: 54713, ndim: 25()
W5j2 - docs: 7452[OCRerr]), terms: ndim: 235
AP2 - docs: 79923, terms: 82997, ndim: 235
ZIFF2 - docs: 5692([OCRerr], terms: 72197, ndim: 235
FR2 - docs: terms: 48728, ndim: 235
-> 169 meg dl)
-> 163 meg dl)
-> 135 meg dl)
-> 80 meg dl)
-> 141 meg dl)
-> 153 meg dl)
-> 121 meg dl)
-> 64 meg dl)
Used 25() dims fi)r routing and 235 dims for ad hoc ([OCRerr]uerjes
In general, database size will be: (ndocs+nterms)*ndim*4
The totals here are 1288 meg (750000 docs and 585([OCRerr]00 terms).
If a single database had been used, the total would have been smaller
becauSe of term overlap--currently, many of the terms are represented in
more than oliC datal)ase; there are only 2000(M) Uni(Jue terms.
b. t()t[OCRerr]l Computer tilne to build ([OCRerr][OCRerr]ppr()xilnate iiuinber of hours)
Four main stages:
I. indexing (extracting keys; calculating wts; etc.)
2. SVD (number [OCRerr])f d1111C1151()fl5 extracted ranged from 235-310)
NOTE 1: only 235-25() dims were actually used fi)r retrieval. I don't have
timing data for extracting only this smaller numl)er of dimensions,
but I'd estimate that the numbers t[OCRerr])r APi, ZIFFi and FRi could
l)C reduced by about 20%.
NOTE 2: initial indexing and SVD are typically done on a subset of 50()00
docs and uterms
3. various i/o translations (much ([OCRerr]f this will g(j away soon)
4. adding new docs to dl)a5e (if sul)-sampled for SVD).
SVI) done oil 5([OCRerr](H)() docs; the remaining docs are indexed and added to the
datal)ase after the SVl).
all times in [OCRerr]1INUTES (SVD
DOEI - index: 49 SVD: 1219 io:
wSj1 - index: 241 SVD: 1474 i():
APi - index: 271 SVD: 1644 i():
ZIFFi - index: 241 SVD: 1359 i():
FRi - index: 241 SVD: 939 io:
WSJ2 - index: 427 SVD: 1382 io:
AP2 - index: 338 SVD: 1210 i():
ZIFF2 - index: 260 SVD: 1452 i():
FR2 - index: 187 SVD: 486 io:
run on DECS()()(); rest on SPARC2)
194 add: 591 SUM: 2053 mins
174 add: 4()4 SUM: 2293 mins
214 add: 455 SUM: 2584 mins
156 add: 352 SUM: 2108 mins
133 add: 0 SUM: 1313 mins
22(J add: 461 SUM: 2490 mins
218 add: 273 SUM: 2([OCRerr]9 mins
2()8 add: 0 SUM: 1920 mins
loS add: 0 SUM: 778 mins
C. i.[OCRerr] the pr([OCRerr]es.[OCRerr] Completely [OCRerr]ut()Jn[OCRerr]'ttiC? YES
d. brief deNcription of Ineth()d.[OCRerr] u.[OCRerr]ed
LSI/SVD analysis ([OCRerr]f document collection
1. creates raw term-l)y.d()c matrix; transf[OCRerr])rms entries using log entropy
term weightings
2. calculates beSt "red uced-dimensional" approximation to transformed
matrix using SV1). Number of dimensions in the SVD calculations ranged
fi[OCRerr]()m 235 to 3I([OCRerr]. BUT, only 235 ()[OCRerr] 250 were used f(Jr the comparisons.
Fewer dims could have been calculated, So Some reported SVD times are
higher than necessary. I'd estimate about 2()% reductions in SVD times for
API, ZIFFI, and FRI.
3. perf[OCRerr])rm various datal)ase translations. Current SVD program outputs
vectors in a different f([OCRerr]rmat and order than we need for the database. It
468