NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman A. Tot£[OCRerr] coinpuler tune to se£'uch (cpu seconds) 1. retrieval tilne (to[OCRerr]Il CPU seconds between When a query enters tlie system Until a list of document numbers [OCRerr]ire obtained) Time = -5()([OCRerr])([OCRerr] query-doc c()mparisonsiminute when all vectors are pre-loaded. Currently, we c[OCRerr][OCRerr]mpare ALL d([OCRerr]cs t([OCRerr] each query. For ad hoc queries, the time to c[OCRerr][OCRerr]mpare a query to the 75([OCRerr]K[OCRerr][OCRerr] docs is -12 minutes For r([OCRerr]uting queries, the time to c[OCRerr][OCRerr]mpare a query (new doc) to the profiles (50 profiles in each ([OCRerr]f 4 datal)ases) is al)out .3 Sec 2. rankini2 time (tot£'Ll cpu seconds to sort d([OCRerr]Ument list) none; it's included in the times given in 1. Currently 1)0th comparisons and ranking are done in the same routine B. Which methods best describe [OCRerr]()U[OCRerr] `u[OCRerr][OCRerr]chine se[OCRerr]'irching me[OCRerr]ods? 1. vector sp[OCRerr]'[OCRerr]e m(KIcl C. What factors £`ire included in y()w r[OCRerr]'uiking? Hum, not sure I get this. Similarity l)etWCeIl a query and a document is the cosine l)etween the query vector and the document vectE)r. This cosine determines the rank. Term weights are used to determine the location [OCRerr])t. the query vector. The query is located at the weighted vector sum ([OCRerr]f i[OCRerr]s constituent terms. 1. tenn fiequency l()g(tf)*(1[OCRerr]efltr[OCRerr][OCRerr]py) term weight; s([OCRerr] there's a tf part 3. other tenn weights (where do they come from?) log entropy; weight come fi[OCRerr]om training docs (diski) for routing queries, and from both the training and test docs for ad hoc queries 4. se'n£uitic Closeness (as in semantic net distance) sort of; if you think of term vector I()cati()ns as reflecting semantic [OCRerr]ssociations. But these locations are auto derived from the SVD analysis 8. infonnation theoretic weights IE)g (tt) * (1-entropy) IV. What machine did you condUct tlie TREC experiment on? How much RAM did it have? What was the clock rate of [OCRerr]e CPU? SVDs run on DEC5t)t)() wi --4([OCRerr]([OCRerr] meg; clock is ??? MHz all else run on Si[OCRerr]ARC 2 WI 384 meg; clock is 25 MHz (I think) V. Some systems [OCRerr]we rese[OCRerr][OCRerr][OCRerr] prototypes md others LrC c()Inmerci[OCRerr]'tl. To help C()InP[OCRerr]C tliese systems: 1. How much "softw(tre eIl'[OCRerr]illeenn[OCRerr]' went into the development of your system? Real hard. The system was huilt as a research prototype to l(x)k at many different issues. I'd say aboUt 1-2 person-years, l)ut this is much more than would have heen required if specs had l)eell fixed at the beginning. 2. ([OCRerr]`iven appr()pfl[OCRerr]1te rcs()Urces. coUld yoUr system be made to run f£%';ter? By how much (estim£'[OCRerr]le)? The existing tools were used pretty much as is for TREC, even though they were devel([OCRerr]ped t([OCRerr] work with much smaller databases. Also, there are far more parameters and options than we typically use. Alm([OCRerr]t no effort went into re-engineering for large databases ()[OCRerr] to more efficiently handle what we now use as default parameters. Time in query c([OCRerr]nstructi()n and retrieval are spent: 1) seeking for vectors in a single large database of term and doc vectors. The database could easily be split. 2) many calculations (scalings ([OCRerr]t. various sorts) are done on the fly. This could be eliminated if one knew that users wanted to retrieve only 470