SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) N-Gram-Based Text Filtering For TREC-2 chapter W. Cavnar National Institute of Standards and Technology D. K. Harman match between two words with different capitalizations as a good match, but a match between two words with the same capitalizations as a better one. * Our system is basically just a text filter. As such, it is akeady close to being useful for various routing tasks. However, to use N-gram-based matching for on-line retrieval, we will have to implement a true index-based system. Ideally, we would like to integrate the text retrieval capability of our zview system with the flexibil- ity of the multi-query filtering system described here. Although we have an initial design sketched out, it will take an investment of further time and machine resources to implement this idea and test it. We will also be using some of our new compact N-gram representa- tion techniques to reduce the large amount of index stor- age and computation required. Acknowledgments We would like to thank our sponsors at DARPA/SISTO for allowing us the opportunity to participate in TREC-2. We would also like to thank Donna Harman and her staff at NIST for all of their help. Finally, we are grateful for the United States Postal Service's sponsorship of the original N-gram-based matching technology that inspired our TREC research. 179 References [1] W. B. Frakes. Stemming Algorithms. In William B. Frakes and Ricardo Baeza-Yates, Editors. Injo[OCRerr][OCRerr][OCRerr]ation Retrieval: Data Structures & Algorithms, pages 131- 160. Prentice Hall, Inc. Englewood Cliffs, NJ, 1992. [2] William B. Cavnar and Alan J.Vayda. Using superim- posed coding of N-gram lists for efficient inexact matching. In Proceedings of the Fifth USPS Advanced Technology Conference, pages 253-267, Washington, DC, 1992. [3] William B. Cavnar and Man J.Vayda. N-gram-based matching for multi-field database access in postal appli- cations. In Proceedings of the 1993 Symposium On Document Analysis and Information Retrieval, pages 287-297, University of Nevada, Las Vegas. [4] Mark Zimmerman. Proximity-correlation for document ranking: The PARA Group's TREC Experiment. In Proceedings of the First Text REtrieval Conftrence (TREC-1), NIST Special Publication 500-207, 1992.