NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) N-Gram-Based Text Filtering For TREC-2 chapter W. Cavnar National Institute of Standards and Technology D. K. Harman Each line of the query output corresponds to one nationality or concept string from the topic. The nationality strings are listed first, followed by the concept strings in the order they appeared. There is also a rank number associated with each query string. Generally, as concept strings were used in the topics, strings that appeared earlier in the list were more important than strings that appeared later. The actual filter- ing program also uses these rank numbers to weight the query strings in importance. Each string receives a weight equal to 2k - r, where k is the number of query strings for the topic and r is the string's rank value. Any nationality or concept string can also have a NOT prefix, which triggers a slighfly different kind of processing for these strings, described below. 2.2 The Problem of Finding Matches While Filtering Text In Figure 1 above, the process labelled "zcat" represents a Unix utility program that takes a compressed file and returns its uncompressed form. The process labelled "Find Matches" represents a programfind[OCRerr]matches, which takes a set of queries for all of the topics at once, and applies them to a single uncompressed text file. Given the filter architec- ture for this system, it was important for us to minimize the total amount of I/O. By letting find_matches process the queries for all topics simultaneously, the system has to pass the data only once. To make this process as efficient as possible, find_matches does most of its matching using bit-vectors representing the presence or absence of particular N-grams. Suppose that we had a string that we wanted to search for in a document using N-grams. An obvious way to do it would be to per- form the following steps: Break the query string up into N-grams. * For every line in the document, count how many of those N-grams occurred in the line. This count would be the match score for that line. * Keep a sorted list of the top-scoring lines, bumping the lowest scoring entry on this list when a new line has a better score. Unfortunately, this naive implementation of N-gram-based matching, which uses nested loops to compare every N-gram in the query string with every N-gram in a line, is computationally very cosdy. 2.3 Matching and Scoring Lines We can do significantly better than the naive N-gram-based algorithm described above by using hashing. Figure 2 illus- trates this process for bi-grams. The key idea here is to break the matching process into two parts: * Initialize a hash-code-based look-up table for all of the N-grams in the query string. Each slot simply records whether any N-gram hashed to that position (1) or not (0). Thus the hash table is a simple bit vector. * For each line in the document, make a single pass for all of the line's N-grams, checking each one to see if it is in the table. In other words, see if the slot Q,it) correspond- ing to the N-gram contains a 1. If it does, add 1 to the line's score. If we make the hash table big enough, we can make the probability of collisions arbitrarily small. For example, con- sider that there are only 52,022 possible bi-grams and tri- grams for the alphabet consisting of A-Z, 0-9, and space. A hash table size of, say, 20,000 slots (625 32-bit words or 2500 bytes) provides far more than ample space to avoid collisions. if there is a collision between the hash of an N-gram in the query string and that of some different N-gram in the line, its only effect is to cause that line's score to be one higher than it should be. It is even more unlikely that two or more of the N-grams would also collide. Thus by allowing collisions in the hash table, we get a simple, quick algorithm at the cost of a small probability of small distor- tions in the line scores. See [2] for a further discussion of how to estimate the probability of these kinds of collisions. Note that we include leading and trailing spaces for words in counting N-grams. Also recall that in our full implemen- tation of this we process tri-grams the same way and at the same time. The match score between two strings includes the count of both matching bi-grams and matching tri- grams. When this algorithm finishes, we will have a list of the top scoring matches for our query string. We could make this process a little more sophisticated by putting a threshold of some sort on it. If a line's match score was below this threshold, say 80% of the number of N-grams in the query string, we could ignore this line as simply not containing enough of the query string to consider. 174 To get an additional gain in efficiency, we can carry this hash table scheme for counting N-grams another step by using parallelism to match a single line from a document against a number of query strings all at the same time. Fig- ure 3 illustrates this scheme. Again, matching is a two-step process: