NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Appendix B: System Features Appendix National Institute of Standards and Technology D. K. Harman IA. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OThER DATA STRUCTURES -- MEIHODS USED [wo Stoplists were used. [OCRerr]ystem 1: For routing and manual/feedback 411 stopwords + 58 semi-stopwords. ,ystem 2: For the automatic adhoc run 247 stopwords + 226 semi-stopwords [OCRerr]mi-stopwords are not used in query expansion unless they also appear in the query. [OCRerr]ased on Porter with enhancements to deal with peculiar plurals and partial conflation of British/American spellings. )ate ranges are recognized, but no use was made of this. ast inversion method operating in limited main memory (peak requirement 40 MB) and limited temporary disk (peak requirement 50 MB more than the f ioth text and index stored compressed. Ml alphanumeric strings indexed. ;emi-stopwords are not used in query expansion unless they also appear in the query. ;ome phrases are recognized by default okapi go4ist. New phrases "discovered" from concepts section of topics 1-100 by treating comma-separated text [OCRerr] [OCRerr]ff * nidf; totally n documents, feature f, document d; feature frequency: ff(f,d) iormalized feature frequency: nif (f,d) = ff(f,d)/max {ff (fi,d) I fi element d} iocument frequency df(f)= I {dj If element dj} I nverse document frequency idf(f)=log ((n+1)I(df(f)+ 1)) iormalized inverse document frequency nidf(f)=idf(f)/log(n+ 1)= 1-log (df(f)+ 1)/log (n+1) nidf's are pre-computed from disk 1 & 2 only new features from disk 3 have an nidf of 1.0 [OCRerr]apping each term t to a 32 bit integer by applying two hash lunctions to the term and by hashing the two resulting numbers into one number. *-> max 3 terms mapped to same integer (only in two cases). *-> only 426 terms (0.1% of all terms) are mapped to ambiguous numbers. We used no special index data structures for TRW1 [OCRerr]roximity queries).