SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
A. Tot£[OCRerr] coinpuler tune to se£'uch (cpu seconds)
1. retrieval tilne (to[OCRerr]Il CPU seconds between When a query enters tlie system Until a list of
document numbers [OCRerr]ire obtained)
Time = -5()([OCRerr])([OCRerr] query-doc c()mparisonsiminute when all vectors are pre-loaded.
Currently, we c[OCRerr][OCRerr]mpare ALL d([OCRerr]cs t([OCRerr] each query.
For ad hoc queries, the time to c[OCRerr][OCRerr]mpare a query to the 75([OCRerr]K[OCRerr][OCRerr] docs is -12 minutes
For r([OCRerr]uting queries, the time to c[OCRerr][OCRerr]mpare a query (new doc) to the profiles (50
profiles in each ([OCRerr]f 4 datal)ases) is al)out .3 Sec
2. rankini2 time (tot£'Ll cpu seconds to sort d([OCRerr]Ument list)
none; it's included in the times given in 1. Currently 1)0th comparisons and ranking
are done in the same routine
B. Which methods best describe [OCRerr]()U[OCRerr] `u[OCRerr][OCRerr]chine se[OCRerr]'irching me[OCRerr]ods?
1. vector sp[OCRerr]'[OCRerr]e m(KIcl
C. What factors £`ire included in y()w r[OCRerr]'uiking?
Hum, not sure I get this. Similarity l)etWCeIl a query and a document is the cosine l)etween
the query vector and the document vectE)r. This cosine determines the rank.
Term weights are used to determine the location [OCRerr])t. the query vector. The query is located
at the weighted vector sum ([OCRerr]f i[OCRerr]s constituent terms.
1. tenn fiequency l()g(tf)*(1[OCRerr]efltr[OCRerr][OCRerr]py) term weight; s([OCRerr] there's a tf part
3. other tenn weights (where do they come from?)
log entropy; weight come fi[OCRerr]om training docs (diski) for routing queries, and from
both the training and test docs for ad hoc queries
4. se'n£uitic Closeness (as in semantic net distance)
sort of; if you think of term vector I()cati()ns as reflecting semantic [OCRerr]ssociations. But
these locations are auto derived from the SVD analysis
8. infonnation theoretic weights IE)g (tt) * (1-entropy)
IV. What machine did you condUct tlie TREC experiment on?
How much RAM did it have?
What was the clock rate of [OCRerr]e CPU?
SVDs run on DEC5t)t)() wi --4([OCRerr]([OCRerr] meg; clock is ??? MHz
all else run on Si[OCRerr]ARC 2 WI 384 meg; clock is 25 MHz (I think)
V. Some systems [OCRerr]we rese[OCRerr][OCRerr][OCRerr] prototypes md others LrC c()Inmerci[OCRerr]'tl.
To help C()InP[OCRerr]C tliese systems:
1. How much "softw(tre eIl'[OCRerr]illeenn[OCRerr]' went into the development of your system?
Real hard. The system was huilt as a research prototype to l(x)k at many different
issues. I'd say aboUt 1-2 person-years, l)ut this is much more than would have heen
required if specs had l)eell fixed at the beginning.
2. ([OCRerr]`iven appr()pfl[OCRerr]1te rcs()Urces. coUld yoUr system be made to run f£%';ter? By how much
(estim£'[OCRerr]le)?
The existing tools were used pretty much as is for TREC, even though they were
devel([OCRerr]ped t([OCRerr] work with much smaller databases. Also, there are far more
parameters and options than we typically use. Alm([OCRerr]t no effort went into
re-engineering for large databases ()[OCRerr] to more efficiently handle what we now use as
default parameters.
Time in query c([OCRerr]nstructi()n and retrieval are spent:
1) seeking for vectors in a single large database of term and doc vectors.
The database could easily be split.
2) many calculations (scalings ([OCRerr]t. various sorts) are done on the fly. This
could be eliminated if one knew that users wanted to retrieve only
470