SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
Each line of the query output corresponds to one nationality
or concept string from the topic. The nationality strings are
listed first, followed by the concept strings in the order they
appeared. There is also a rank number associated with each
query string. Generally, as concept strings were used in the
topics, strings that appeared earlier in the list were more
important than strings that appeared later. The actual filter-
ing program also uses these rank numbers to weight the
query strings in importance. Each string receives a weight
equal to 2k - r, where k is the number of query strings for
the topic and r is the string's rank value. Any nationality or
concept string can also have a NOT prefix, which triggers a
slighfly different kind of processing for these strings,
described below.
2.2 The Problem of Finding Matches While
Filtering Text
In Figure 1 above, the process labelled "zcat" represents a
Unix utility program that takes a compressed file and
returns its uncompressed form. The process labelled "Find
Matches" represents a programfind[OCRerr]matches, which takes a
set of queries for all of the topics at once, and applies them
to a single uncompressed text file. Given the filter architec-
ture for this system, it was important for us to minimize the
total amount of I/O. By letting find_matches process the
queries for all topics simultaneously, the system has to pass
the data only once.
To make this process as efficient as possible, find_matches
does most of its matching using bit-vectors representing the
presence or absence of particular N-grams. Suppose that we
had a string that we wanted to search for in a document
using N-grams. An obvious way to do it would be to per-
form the following steps:
Break the query string up into N-grams.
* For every line in the document, count how many of
those N-grams occurred in the line. This count would be
the match score for that line.
* Keep a sorted list of the top-scoring lines, bumping the
lowest scoring entry on this list when a new line has a
better score.
Unfortunately, this naive implementation of N-gram-based
matching, which uses nested loops to compare every
N-gram in the query string with every N-gram in a line, is
computationally very cosdy.
2.3 Matching and Scoring Lines
We can do significantly better than the naive N-gram-based
algorithm described above by using hashing. Figure 2 illus-
trates this process for bi-grams. The key idea here is to
break the matching process into two parts:
* Initialize a hash-code-based look-up table for all of the
N-grams in the query string. Each slot simply records
whether any N-gram hashed to that position (1) or not
(0). Thus the hash table is a simple bit vector.
* For each line in the document, make a single pass for all
of the line's N-grams, checking each one to see if it is in
the table. In other words, see if the slot Q,it) correspond-
ing to the N-gram contains a 1. If it does, add 1 to the
line's score.
If we make the hash table big enough, we can make the
probability of collisions arbitrarily small. For example, con-
sider that there are only 52,022 possible bi-grams and tri-
grams for the alphabet consisting of A-Z, 0-9, and space. A
hash table size of, say, 20,000 slots (625 32-bit words or
2500 bytes) provides far more than ample space to avoid
collisions. if there is a collision between the hash of an
N-gram in the query string and that of some different
N-gram in the line, its only effect is to cause that line's score
to be one higher than it should be. It is even more unlikely
that two or more of the N-grams would also collide. Thus by
allowing collisions in the hash table, we get a simple, quick
algorithm at the cost of a small probability of small distor-
tions in the line scores. See [2] for a further discussion of
how to estimate the probability of these kinds of collisions.
Note that we include leading and trailing spaces for words
in counting N-grams. Also recall that in our full implemen-
tation of this we process tri-grams the same way and at the
same time. The match score between two strings includes
the count of both matching bi-grams and matching tri-
grams.
When this algorithm finishes, we will have a list of the top
scoring matches for our query string. We could make this
process a little more sophisticated by putting a threshold of
some sort on it. If a line's match score was below this
threshold, say 80% of the number of N-grams in the query
string, we could ignore this line as simply not containing
enough of the query string to consider.
174
To get an additional gain in efficiency, we can carry this
hash table scheme for counting N-grams another step by
using parallelism to match a single line from a document
against a number of query strings all at the same time. Fig-
ure 3 illustrates this scheme. Again, matching is a two-step
process: