SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
match between two words with different capitalizations
as a good match, but a match between two words with
the same capitalizations as a better one.
* Our system is basically just a text filter. As such, it is
akeady close to being useful for various routing tasks.
However, to use N-gram-based matching for on-line
retrieval, we will have to implement a true index-based
system. Ideally, we would like to integrate the text
retrieval capability of our zview system with the flexibil-
ity of the multi-query filtering system described here.
Although we have an initial design sketched out, it will
take an investment of further time and machine
resources to implement this idea and test it. We will also
be using some of our new compact N-gram representa-
tion techniques to reduce the large amount of index stor-
age and computation required.
Acknowledgments
We would like to thank our sponsors at DARPA/SISTO for
allowing us the opportunity to participate in TREC-2. We
would also like to thank Donna Harman and her staff at
NIST for all of their help. Finally, we are grateful for the
United States Postal Service's sponsorship of the original
N-gram-based matching technology that inspired our TREC
research.
179
References
[1] W. B. Frakes. Stemming Algorithms. In William B.
Frakes and Ricardo Baeza-Yates, Editors. Injo[OCRerr][OCRerr][OCRerr]ation
Retrieval: Data Structures & Algorithms, pages 131-
160. Prentice Hall, Inc. Englewood Cliffs, NJ, 1992.
[2] William B. Cavnar and Alan J.Vayda. Using superim-
posed coding of N-gram lists for efficient inexact
matching. In Proceedings of the Fifth USPS Advanced
Technology Conference, pages 253-267, Washington,
DC, 1992.
[3] William B. Cavnar and Man J.Vayda. N-gram-based
matching for multi-field database access in postal appli-
cations. In Proceedings of the 1993 Symposium On
Document Analysis and Information Retrieval, pages
287-297, University of Nevada, Las Vegas.
[4] Mark Zimmerman. Proximity-correlation for document
ranking: The PARA Group's TREC Experiment. In
Proceedings of the First Text REtrieval Conftrence
(TREC-1), NIST Special Publication 500-207, 1992.