SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
N-gram-based matching addresses the difficulties men-
tioned above for postal addresses in ways that have applica-
tion for document retrieval:
* It uSually finds good matches for postal addresses even
in the face of a significant number of individual charac-
ter recognition errors. This is particularly true if the sys-
tem can use redundant data for the search key, such as
the city, state, and ZIP together. This is identical to the
problem of matching document text that was scanned
from poor quality images. Words that occur near one
another in a text also "reinforce" each other for search-
ing the way that city, state, and ZIP do.
* It easily overcomes problems with spelling and address-
ing variations and errors on mailpieces or in postal data-
bases. This is analogous to the problem of matching
document text that contains spelling variations. It also
achieves an effect similar to stemming in that different
words having the same root are considered to be similar.
* It can also handle abbreviations fairly easily. This is also
similar to the problem of word stemming.
* Some postal address interpretation systems depend
heavily on the system's ability to recognize certain key
words, such as SThEET or BOX. These words play a
role similar to stopwords in that they carry little content,
and mostly provide structural information. N-gram-
based matching can handle errors even in these words.
Drawing on the experience of using N-grams in postal
work, ERIM has been adapting these techniques for use in
full-text retrieval and filtering. We currently have a working
full-text retrieval system, called zview, which provides a
simple query facility for accessing a wide variety of ASCII
documents, including biographies, abstracts, and documen-
tation. The zview system's N-gram-based matching provides
a very simple, easy-to-use text retrieval system. It is tolerant
of spelling variations and errors in both the text and the doc-
uments. It also provides a straighiforward ranking of docu-
ments that enables the user to see which documents best
match the query. This system has given us some confidence
in the wider utility of N-gram-based matching for informa-
tion retrieval. Although the ThEC task would not seem to
direcfly require inexact matching (the documents were high
quality ASCII), our experience with zyjew indicates that this
kind of tolerance for variations in user input is very useful
for general retrieval.
Incidentally, one criticism that one can make of N-gram-
based systems is that they require large amounts of storage.
Although this is true of the naive implementation, there are
a number of variations using various superimposed coding
172
techniques that provide substantial space savings without
impacting matching performance. See [2] for details.
2.0 System Description
The zyjew system mentioned above has two parts: an off-
line index builder, and an on-line retrieval interface. In its
current form, zyjew does not yet use our most compact
N-gram representation techniques, and is thus not well-
suited for very large (i.e., gigabyte size) databases. Also, it
can handle only single query strings, not the multiple query
strings necessary for TREC. Instead, for this task we opted
to use a simple filter architecture, and concentrated on mak-
mg it as fast as possible. This architecture has several
advantages:
* It requires no disk space for an index. In fact, we were
able to save even more disk space by leaving the original
documents in their compressed form, and uncompress-
ing them on the fly.
* It is conceptually simple, requiring only a very modest
amount of code (987 lines of C source). The whole sys-
tem took only a few days to design and build.
This architecture is similar in concept to that used by Mark
Zimmerman's PARA system, which participated in last
year's TREC. That system used a set of handcrafted Unix
awk scripts to filter the documents. The PARA system,
although it got good results, took 22 days on a more or less
dedicated NeXT machine, even though its match criteria
generally specified fewer match terms than our system did.
Also Zimmerman's system used the notion of proximlty to
identify sets of matching lines that occurred near one
another, whereas our system counts matches throughout the
document without regard to the proximity of match terms.
See [4) for details.
Obviously, a significant disadvantage of such a filter-type
architecture is that the system has to read all of the data to
simulate a retrieval. This is very computationally intensive,
but we were able to mitigate this somewhat in our system by
handling much of the actual N-gram-based matching with a
highly parallel bit-vector approach, which we describe
below. Also, we were able to split the processing load up
among a suite of computers and run them all in a coarse-
grained parallel fashion.
Figure 1 gives an overview of our system in dataflow fash-
ion. We discuss each of the processes in this diagram in the
following sections.
-5