SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
documents, f[OCRerr][OCRerr]r example. Currently l)()th terms and docs can he
retrieved with the same pr[OCRerr][OCRerr]grams and scaling isn't done until we
see thit the user wants retrieved.
3) all calculations are done in tl()ating point. Could he done with integers.
4) each ad hoc ([OCRerr]uery was compared to EVERY d([OCRerr]ument. This can he
speeded up hy 5()[OCRerr]C document clustering algorithms that we have
looked at. This can also he speeded up tremendously hy using more
than one machine or hy using a parallel machine. All vectors are
independent, so it's trivial to split query processing.
I'd guess that improvements ([OCRerr]f a factor of 2-5 could he (Jl)tained just hy tweaking
items 1), 2) and 3).
Parallel query matching is the way to go. For example, we got speed-ups of 5()-1(M)
times using a MasPar for query storage and processing with no attempt to optimize.
In terms of pre-processing and SVD analyses:
I) ahout 1([OCRerr]% ([OCRerr]f the time is spent in unnecessary `10 translation (hecause
we've patched together pre-existing t()()ls). Much of this will
eventually g(i away.
2) more than 5(J% of the time is spent in the SVD. These alg()flthms get
hetter and faster all the time (the algorithm we n[OCRerr]iw use is ahout
I(X[OCRerr] times faster than what we used initially). There are
speed-memory trade()ff%' in different SVD algorithms, so time can
pr()hal)ly he decreased hy a factor of 2 ()[OCRerr] 3 hy using more memory.
Parallel alg([OCRerr]rithms will help Some, hut pr()hahly only hy a factor or
2 ()[OCRerr] 3.
These are ([OCRerr]Ile-tinle costs f[OCRerr])r relatively stahle domains. We've found that new items
can he added to the existing solutions without redoing the scaling f[OCRerr])r a while.
Others ???
3. VVhat features is your system inissin[OCRerr] th£'it it would benefit by if it had them?
Precision would prol)ahly he increased hy many of the standard things--phrases,
proper noun identitication, tokenizer (f[OCRerr][OCRerr]r dates, phone nuinhers, addresses, etc.), and
some hetter handling of negation and union.
S([OCRerr]me form ([OCRerr]f literal string matching might he useful to use in *comhinati()n with LSI
for some types of queries.
Others ???
471