SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) N-Gram-Based Text Filtering For TREC-2 chapter W. Cavnar National Institute of Standards and Technology D. K. Harman ½4RIN;\\ S, ST, TR, RI, IN, NG, G_ 1 1 1 1 ii 1 1 I Boxes indicate matches between N-grams from the two strings. The match score for these two strings is 3. Bi-grams Hash Table Bit Vector [OCRerr]S,SP,PR,RP,UN,NG,G_ Bi-grams SPRUNG Figure 2: Using Bi-Grams to Compare "STRING" and "SPRUNG" * Initialize a hash table for the entire set of query strings. Each slot in the hash table again represents one N-gram, but it is now a bit vector in which the kth bit indicates whether the N-gram occurs in the kth query string. * For every line in the document, hash each N-gram to find the corresponding bit-vector. Then add that bit vec- tor to the current score vector for the document. The kth element of this score vector keeps the current match score for the document against the kth query string. One other significant aspect of this multi-query-string implementation is that we also represent the final score vec- tor for all of the queries as a set of parallel bit-vectors. This effectively allows us to perform 32 additions in parallel by adding one 32-bit word of an N-gram bit-vector from the table to one 32-bit word representing the 1-bits of 32 query string scores, and then propagate the carry bits up to bit-vec- tors representing the higher order bits of the scores. Effec- tively, this lets us compute scores for 32 query strings in close to the same amount of time as it takes to compute the score for a single query string. 175 24 Scoring and Ranking Documents After the system computes the N-gram match scores of a single line against all of the query strings, it applies a threshold criterion. If the score for a particular query string is below its threshold, the system treats it as a zero score. The threshold is defined as a percentage of the maximum possible N-gram score for the query string. For example, if a particular query string has a maximum possible score of 20 matching N-grams, and the tbxeshold percentage (as determined by a command line argument to the program) is 70%, then the system will ignore any line whose score is less than 14. There is a similar command line argument which determines the threshold percentage for negation query strings (those tagged with a NOT prefix in the query string set). If a document has a line whose score for a nega- tion query string exceeds its threshold, the system will dis- card the document. For most runs of the system, we got the best results with a match threshold at 70% and a negation threshold at 95%. In other words, we were willing to accept a line as a match for a given query string if it had 70% of the