NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman documents, f[OCRerr][OCRerr]r example. Currently l)()th terms and docs can he retrieved with the same pr[OCRerr][OCRerr]grams and scaling isn't done until we see thit the user wants retrieved. 3) all calculations are done in tl()ating point. Could he done with integers. 4) each ad hoc ([OCRerr]uery was compared to EVERY d([OCRerr]ument. This can he speeded up hy 5()[OCRerr]C document clustering algorithms that we have looked at. This can also he speeded up tremendously hy using more than one machine or hy using a parallel machine. All vectors are independent, so it's trivial to split query processing. I'd guess that improvements ([OCRerr]f a factor of 2-5 could he (Jl)tained just hy tweaking items 1), 2) and 3). Parallel query matching is the way to go. For example, we got speed-ups of 5()-1(M) times using a MasPar for query storage and processing with no attempt to optimize. In terms of pre-processing and SVD analyses: I) ahout 1([OCRerr]% ([OCRerr]f the time is spent in unnecessary `10 translation (hecause we've patched together pre-existing t()()ls). Much of this will eventually g(i away. 2) more than 5(J% of the time is spent in the SVD. These alg()flthms get hetter and faster all the time (the algorithm we n[OCRerr]iw use is ahout I(X[OCRerr] times faster than what we used initially). There are speed-memory trade()ff%' in different SVD algorithms, so time can pr()hal)ly he decreased hy a factor of 2 ()[OCRerr] 3 hy using more memory. Parallel alg([OCRerr]rithms will help Some, hut pr()hahly only hy a factor or 2 ()[OCRerr] 3. These are ([OCRerr]Ile-tinle costs f[OCRerr])r relatively stahle domains. We've found that new items can he added to the existing solutions without redoing the scaling f[OCRerr])r a while. Others ??? 3. VVhat features is your system inissin[OCRerr] th£'it it would benefit by if it had them? Precision would prol)ahly he increased hy many of the standard things--phrases, proper noun identitication, tokenizer (f[OCRerr][OCRerr]r dates, phone nuinhers, addresses, etc.), and some hetter handling of negation and union. S([OCRerr]me form ([OCRerr]f literal string matching might he useful to use in *comhinati()n with LSI for some types of queries. Others ??? 471