SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
Document "expansion of Medicare to cCP+v catastrophic illnesses, faster"
Line 7
hash
L;½2on
Hashed N-gram
Indices
Query String
Indices
*
A
Accumulated
parallel score
bit-vectors for
hashed N-grams
over all query
strings, up to
and including
"ve" bi-gram
8 42 1
Bit-vector for
kth query string
[OCRerr]______ ______ I
I
Bits for
score of Ii
kth query
string
Add selected
Bit-vector for
presence of a bit-vector to
particular N-gram l's score vector,
and propagate
in all query strings
carry bits up
Figure 3: Parallel Implementation for Multiple Query Strings
requisite N-grams, but needed a nearly exact match in order
to discard it for matching a negation query.
To compute the scores for a whole document, we accumu-
lated the line scores for each query string, but applied a cap
at twice the maximum possible N-gram score. In other
words, if a document mentions a particular query string
twice, that is twice as good as mentioning it once. However
mentioning it three or more times is of no additional value.
We put this cap into the system after observing that initial
versions of the system would overrate some documents
because they mentioned some low ranking query string for a
given topic many times, but completely missed any of the
higher ranking query strings for the topic.
To generate the final scores for a document against a topic,
the system accumulates the scores for each query string in
the topic, multiplying them by the weighting factor com-
puted earlier. If this aggregate weighted score exceeds a
hard[OCRerr]oded threshold of 40, the system then writes out this
score as a relevance judgement for the document. Later a
separate program reads the file containing the relevance
scores, computes the top 1000 scores for each topic, and
writes them out as the final system result.
176
3.0 Multi-Processor Execution
Even with all of this attention to efficient implementation,
the system still required a large amount of computation.
Fortunately, ERIM had available a network of Sun worksta-
tions over which we could distribute the processing load.
This network consisted of two Sun SpARCstations and five
or six Sun SpARCstation 2 machines. We used a very sim-
ple partitioning scheme, assigning each machine a fixed set
of documents to evaluate against all of the topics. Given this
arrangement, a full run of topic set 3 against Disks 1 and 2
took 12 to 13 hours if all of the machines were fully avail-
able. When there was significant contention on some of the
machines, the run might take as much as 24 hours. An obvi-
ous drawback with using the fixed partitioning scheme was
that often one or more of the machines would not finish
until many hours after all of the others had. This was usually
due to unexpected heavy contention on one or more
machines. If we had used a more intelligent and dynamic
document file assignment mechanism (implemented in, say,
the Linda coordination language), we would have had better
elapsed execution times.