SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
½4RIN;\\
S, ST, TR, RI, IN, NG, G_
1 1 1 1 ii 1 1 I
Boxes indicate matches
between N-grams from
the two strings. The
match score for these
two strings is 3.
Bi-grams
Hash
Table
Bit
Vector
[OCRerr]S,SP,PR,RP,UN,NG,G_ Bi-grams
SPRUNG
Figure 2: Using Bi-Grams to Compare "STRING" and "SPRUNG"
* Initialize a hash table for the entire set of query strings.
Each slot in the hash table again represents one N-gram,
but it is now a bit vector in which the kth bit indicates
whether the N-gram occurs in the kth query string.
* For every line in the document, hash each N-gram to
find the corresponding bit-vector. Then add that bit vec-
tor to the current score vector for the document. The kth
element of this score vector keeps the current match
score for the document against the kth query string.
One other significant aspect of this multi-query-string
implementation is that we also represent the final score vec-
tor for all of the queries as a set of parallel bit-vectors. This
effectively allows us to perform 32 additions in parallel by
adding one 32-bit word of an N-gram bit-vector from the
table to one 32-bit word representing the 1-bits of 32 query
string scores, and then propagate the carry bits up to bit-vec-
tors representing the higher order bits of the scores. Effec-
tively, this lets us compute scores for 32 query strings in
close to the same amount of time as it takes to compute the
score for a single query string.
175
24 Scoring and Ranking Documents
After the system computes the N-gram match scores of a
single line against all of the query strings, it applies a
threshold criterion. If the score for a particular query string
is below its threshold, the system treats it as a zero score.
The threshold is defined as a percentage of the maximum
possible N-gram score for the query string. For example, if
a particular query string has a maximum possible score of
20 matching N-grams, and the tbxeshold percentage (as
determined by a command line argument to the program) is
70%, then the system will ignore any line whose score is
less than 14. There is a similar command line argument
which determines the threshold percentage for negation
query strings (those tagged with a NOT prefix in the query
string set). If a document has a line whose score for a nega-
tion query string exceeds its threshold, the system will dis-
card the document. For most runs of the system, we got the
best results with a match threshold at 70% and a negation
threshold at 95%. In other words, we were willing to accept
a line as a match for a given query string if it had 70% of the