SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) N-Gram-Based Text Filtering For TREC-2 chapter W. Cavnar National Institute of Standards and Technology D. K. Harman N-gram-based matching addresses the difficulties men- tioned above for postal addresses in ways that have applica- tion for document retrieval: * It uSually finds good matches for postal addresses even in the face of a significant number of individual charac- ter recognition errors. This is particularly true if the sys- tem can use redundant data for the search key, such as the city, state, and ZIP together. This is identical to the problem of matching document text that was scanned from poor quality images. Words that occur near one another in a text also "reinforce" each other for search- ing the way that city, state, and ZIP do. * It easily overcomes problems with spelling and address- ing variations and errors on mailpieces or in postal data- bases. This is analogous to the problem of matching document text that contains spelling variations. It also achieves an effect similar to stemming in that different words having the same root are considered to be similar. * It can also handle abbreviations fairly easily. This is also similar to the problem of word stemming. * Some postal address interpretation systems depend heavily on the system's ability to recognize certain key words, such as SThEET or BOX. These words play a role similar to stopwords in that they carry little content, and mostly provide structural information. N-gram- based matching can handle errors even in these words. Drawing on the experience of using N-grams in postal work, ERIM has been adapting these techniques for use in full-text retrieval and filtering. We currently have a working full-text retrieval system, called zview, which provides a simple query facility for accessing a wide variety of ASCII documents, including biographies, abstracts, and documen- tation. The zview system's N-gram-based matching provides a very simple, easy-to-use text retrieval system. It is tolerant of spelling variations and errors in both the text and the doc- uments. It also provides a straighiforward ranking of docu- ments that enables the user to see which documents best match the query. This system has given us some confidence in the wider utility of N-gram-based matching for informa- tion retrieval. Although the ThEC task would not seem to direcfly require inexact matching (the documents were high quality ASCII), our experience with zyjew indicates that this kind of tolerance for variations in user input is very useful for general retrieval. Incidentally, one criticism that one can make of N-gram- based systems is that they require large amounts of storage. Although this is true of the naive implementation, there are a number of variations using various superimposed coding 172 techniques that provide substantial space savings without impacting matching performance. See [2] for details. 2.0 System Description The zyjew system mentioned above has two parts: an off- line index builder, and an on-line retrieval interface. In its current form, zyjew does not yet use our most compact N-gram representation techniques, and is thus not well- suited for very large (i.e., gigabyte size) databases. Also, it can handle only single query strings, not the multiple query strings necessary for TREC. Instead, for this task we opted to use a simple filter architecture, and concentrated on mak- mg it as fast as possible. This architecture has several advantages: * It requires no disk space for an index. In fact, we were able to save even more disk space by leaving the original documents in their compressed form, and uncompress- ing them on the fly. * It is conceptually simple, requiring only a very modest amount of code (987 lines of C source). The whole sys- tem took only a few days to design and build. This architecture is similar in concept to that used by Mark Zimmerman's PARA system, which participated in last year's TREC. That system used a set of handcrafted Unix awk scripts to filter the documents. The PARA system, although it got good results, took 22 days on a more or less dedicated NeXT machine, even though its match criteria generally specified fewer match terms than our system did. Also Zimmerman's system used the notion of proximlty to identify sets of matching lines that occurred near one another, whereas our system counts matches throughout the document without regard to the proximity of match terms. See [4) for details. Obviously, a significant disadvantage of such a filter-type architecture is that the system has to read all of the data to simulate a retrieval. This is very computationally intensive, but we were able to mitigate this somewhat in our system by handling much of the actual N-gram-based matching with a highly parallel bit-vector approach, which we describe below. Also, we were able to split the processing load up among a suite of computers and run them all in a coarse- grained parallel fashion. Figure 1 gives an overview of our system in dataflow fash- ion. We discuss each of the processes in this diagram in the following sections. -5