SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Appendix B: System Features
Appendix
National Institute of Standards and Technology
D. K. Harman
V. SYSThM COMPARISON
NAME ] GE [OCRerr] CLARIT J 151 ] HNC ] NYU ] SYRACUSE_1__
The system used for The 151 system was Severai years Quite a lot of code Not much. It Q'
[OCRerr]IREC-2 processing was built as a research rewriting was done was the first rej
developed as a prototype to look at to adjust NIST prototype to
university-research human interface system to handle the testing with reE
ch prototype. It is issues, and designed large index (8 times the current rei
engineered for to work on much larger than without functionalities ini
ing" went Zero robustness and smaller databases. compound terms). fil[OCRerr]
flexibility, rather than I'd guess that about
lent? speed. Most of the 1-2 person years
components of the were spent on
system are less than two various aspects of
years old. The the system.
research-prototype code
(essentially all C) is not
_________ ____________________ production-quality.
For routing, it could be We anticipate at least Yes, 20%AO% Base IR system is Yes, with Y[OCRerr]
a bit faster. For an order of magnitude searching much better than it careful design cc
retrieval, it is speed improvement in entire was during TREC-1. of the data trE'
)propriate compatible with any the system within the database. However, second structures and P[OCRerr]'
S, could inverted indexing next six months. This Many orders phase of index elimination of us
Lem be strategy. will be possible due to of magnitude building is still slow the features M
Run (1) re[OCRerr]ngineering of the faster with and fragile. added for su
By How system and (2) the use document experimental n[OCRerr]
of optimization utilities clustering Qiad purposes, the cc
sold for the DEC an order of 15 speed can be si,
AU[OCRerr]HA platform. [OCRerr]e speed-up on a improved
current 0SF compiler foreign significantly,
does not optimize code corpus.) at least by an
appropriately for the order of
AlPHA (64 bit) magnitude.
architecture, with
_________ _____________________ disappointing results). [1] _____________ __________________ _____________
This approach is very The CLARIT TREC-2 Better tokenization, Word sense - There is still a lot [OCRerr]
simple and has no fancy system did not take including proper disambiguation of room for di
atures are features. Better advantage of several noun identification, (already in improvement of b[OCRerr]
Lhat would tokenization, special processing options that phrases, and early NIiP programs. p]
[OCRerr]ur purpose query handling, may have given perhaps some better development). A feedback [OCRerr]
proximity, and negation, improved results, treatment of "nots." Document mechanism would
for example, could help including tokenization, Precision enhancing cluster (speed be helpful.
a lot, as would better subAexicon discovery methods would also up retrievals). - Faster indexing
ranking. over training sets, and help some.
EQ[OCRerr]ass discovery for I
thesaurus terms. [OCRerr]___________________ ____________________________________ [2] __