SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) N-Gram-Based Text Filtering For TREC-2 chapter W. Cavnar National Institute of Standards and Technology D. K. Harman N-Gram-Based Text Filtering For TREC-2 William B. Cavnar Environmental Research Institute of Michigan P.O. Box 134001 Ann Arbor, MI 48113-4001 cavnar@eriin.org Abstract Most text retrieval and filtering systems depend heavily on the accuracy of the text they process. In other words, the various mechanisms that they use depend on every word in the text and in the queries being correctly and completely spelled. To get around this limitation, our experimental text filtering system uses N-gram-based matching for document retrieval and routing tasks. The system's first application was for the TREC-2 retrieval and routing task. Its perfor- mance on this task was promising, pointing the way for sev- eral types of enhancements, both for speed and effectiveness. 1.0 Background Even though modern character recognition techniques are steadily improving, scanning and recognizing characters in paper documents still results in many errQrs. These errors can impact the ability of a standard, classical text retrieval system to process the resulting ASCII text by interfering with word stemming, stopword identification, or even basic indexing. Furthermore, as paper documents age, their print quality can degrade, possibly causing flirther recognition problems. This is becoming a serious problem, especially given the vast amount of material that exists in a paper form that has a limited lifetime. Another difficulty is that the texts themselves may contain spelling variations (e.g., British vs. American) or outright spelling errors. The same is true of the queries entered by a human user. Discrepancies between the spelling of a word in a document and a word in a query may prevent a match. A third problem area is that current word stemming and stopword list construction algorithms sometimes involve a significant amount of knowledge about the documents' lan- guage. Although there are automatic approaches to acquir- ing this kind of knowledge, it is complicated [1]. An ideal text retrieval method would require no special language knowledge, and in fact would be able to handle multiple languages (represented, say, in Unicode) simultaneously. 171 All of these problems have analogs in a somewhat different problem domain, namely, reading and interpreting postal addresses on pieces of mail. For the last eight years, the Environmental Research Institute of Michigan [OCRerr]RIM) has conducted research for the United States Postal Service (USPS) in automatic address interpretation. The ultimate goal of this research is to build a high-performance address interpretation unit that can reliably and efficienfly read the address information on a mailpiece, and determine its cor- rect 9-digit or 11-digit ZIP code, even if that code is not present on the mailpiece or is in error. There are two parts to interpreting an address from a mail- piece, and both of them are difficult: * Recognizing the characters in the address block. This is difficult because of poor print quality, poor contrast, complicated backgrounds, presence of logos and non- address information, use of proportional fonts, and a host of other problems. * Interpreting the lines of the address. This is difficult because of spelling errors, addressing errors, unusual abbreviations, strange local addressing conventions, and errors and omissions in the USPS databases. Our current address interpretation systems use a number of different techniques to get around these problems. Among these is the use of N-gram-based matching to find matches in postal databases. (Please see [2] and [3] for details of these systems.) In this work, N-grams are N-character slices of some longer string. We do not use positional N-grams; i.e., what is important is whether a given string contains a particular N-gram, not where it falls in the string. We use bi-grams [OCRerr]=2) and tri-grams [OCRerr]=3) together in the same system. Bi-grams and tri-grams complement each other in the sense that bi-grams provide somewhat better matching for indi- vidual words, while tri-grams provide the connections between words to improve phrase matching.