SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) N-Gram-Based Text Filtering For TREC-2 chapter W. Cavnar National Institute of Standards and Technology D. K. Harman Document "expansion of Medicare to cCP+v catastrophic illnesses, faster" Line 7 hash L;½2on Hashed N-gram Indices Query String Indices * A Accumulated parallel score bit-vectors for hashed N-grams over all query strings, up to and including "ve" bi-gram 8 42 1 Bit-vector for kth query string [OCRerr]______ ______ I I Bits for score of Ii kth query string Add selected Bit-vector for presence of a bit-vector to particular N-gram l's score vector, and propagate in all query strings carry bits up Figure 3: Parallel Implementation for Multiple Query Strings requisite N-grams, but needed a nearly exact match in order to discard it for matching a negation query. To compute the scores for a whole document, we accumu- lated the line scores for each query string, but applied a cap at twice the maximum possible N-gram score. In other words, if a document mentions a particular query string twice, that is twice as good as mentioning it once. However mentioning it three or more times is of no additional value. We put this cap into the system after observing that initial versions of the system would overrate some documents because they mentioned some low ranking query string for a given topic many times, but completely missed any of the higher ranking query strings for the topic. To generate the final scores for a document against a topic, the system accumulates the scores for each query string in the topic, multiplying them by the weighting factor com- puted earlier. If this aggregate weighted score exceeds a hard[OCRerr]oded threshold of 40, the system then writes out this score as a relevance judgement for the document. Later a separate program reads the file containing the relevance scores, computes the top 1000 scores for each topic, and writes them out as the final system result. 176 3.0 Multi-Processor Execution Even with all of this attention to efficient implementation, the system still required a large amount of computation. Fortunately, ERIM had available a network of Sun worksta- tions over which we could distribute the processing load. This network consisted of two Sun SpARCstations and five or six Sun SpARCstation 2 machines. We used a very sim- ple partitioning scheme, assigning each machine a fixed set of documents to evaluate against all of the topics. Given this arrangement, a full run of topic set 3 against Disks 1 and 2 took 12 to 13 hours if all of the machines were fully avail- able. When there was significant contention on some of the machines, the run might take as much as 24 hours. An obvi- ous drawback with using the fixed partitioning scheme was that often one or more of the machines would not finish until many hours after all of the others had. This was usually due to unexpected heavy contention on one or more machines. If we had used a more intelligent and dynamic document file assignment mechanism (implemented in, say, the Linda coordination language), we would have had better elapsed execution times.