SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
N-Gram-Based Text Filtering For TREC-2
William B. Cavnar
Environmental Research Institute of Michigan
P.O. Box 134001
Ann Arbor, MI 48113-4001
cavnar@eriin.org
Abstract
Most text retrieval and filtering systems depend heavily on
the accuracy of the text they process. In other words, the
various mechanisms that they use depend on every word in
the text and in the queries being correctly and completely
spelled. To get around this limitation, our experimental text
filtering system uses N-gram-based matching for document
retrieval and routing tasks. The system's first application
was for the TREC-2 retrieval and routing task. Its perfor-
mance on this task was promising, pointing the way for sev-
eral types of enhancements, both for speed and
effectiveness.
1.0 Background
Even though modern character recognition techniques are
steadily improving, scanning and recognizing characters in
paper documents still results in many errQrs. These errors
can impact the ability of a standard, classical text retrieval
system to process the resulting ASCII text by interfering
with word stemming, stopword identification, or even basic
indexing. Furthermore, as paper documents age, their print
quality can degrade, possibly causing flirther recognition
problems. This is becoming a serious problem, especially
given the vast amount of material that exists in a paper form
that has a limited lifetime.
Another difficulty is that the texts themselves may contain
spelling variations (e.g., British vs. American) or outright
spelling errors. The same is true of the queries entered by a
human user. Discrepancies between the spelling of a word
in a document and a word in a query may prevent a match.
A third problem area is that current word stemming and
stopword list construction algorithms sometimes involve a
significant amount of knowledge about the documents' lan-
guage. Although there are automatic approaches to acquir-
ing this kind of knowledge, it is complicated [1]. An ideal
text retrieval method would require no special language
knowledge, and in fact would be able to handle multiple
languages (represented, say, in Unicode) simultaneously.
171
All of these problems have analogs in a somewhat different
problem domain, namely, reading and interpreting postal
addresses on pieces of mail. For the last eight years, the
Environmental Research Institute of Michigan [OCRerr]RIM) has
conducted research for the United States Postal Service
(USPS) in automatic address interpretation. The ultimate
goal of this research is to build a high-performance address
interpretation unit that can reliably and efficienfly read the
address information on a mailpiece, and determine its cor-
rect 9-digit or 11-digit ZIP code, even if that code is not
present on the mailpiece or is in error.
There are two parts to interpreting an address from a mail-
piece, and both of them are difficult:
* Recognizing the characters in the address block. This is
difficult because of poor print quality, poor contrast,
complicated backgrounds, presence of logos and non-
address information, use of proportional fonts, and a
host of other problems.
* Interpreting the lines of the address. This is difficult
because of spelling errors, addressing errors, unusual
abbreviations, strange local addressing conventions, and
errors and omissions in the USPS databases.
Our current address interpretation systems use a number of
different techniques to get around these problems. Among
these is the use of N-gram-based matching to find matches
in postal databases. (Please see [2] and [3] for details of
these systems.)
In this work, N-grams are N-character slices of some longer
string. We do not use positional N-grams; i.e., what is
important is whether a given string contains a particular
N-gram, not where it falls in the string. We use bi-grams
[OCRerr]=2) and tri-grams [OCRerr]=3) together in the same system.
Bi-grams and tri-grams complement each other in the sense
that bi-grams provide somewhat better matching for indi-
vidual words, while tri-grams provide the connections
between words to improve phrase matching.