SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) N-Gram-Based Text Filtering For TREC-2 chapter W. Cavnar National Institute of Standards and Technology D. K. Harman Topics Generate `siries Queries Compressed Documents Uncompressed Documents Relevance Scores Find Top Choices Top Choice Scores Figure 1: Dataflow Diagram for N-Gram-Based Text Retrieval 2e1 Generating Query Strings From TREC Topics The process labelled "Generate Queries" represents a pro- gram gen[OCRerr]query, which takes the TREC topics file and gen- erates a set of query strings for each topic. We considered several different schemes for extracting query strings from the topics, but finally settled on using just the phrases in the concept and nationality sections. The following is a typical topic from ThEC-l: <num> Number 007 <title> Topic: U.S. Budget Deficit <desc> Description: Document will mention a proposal to decrease the U.S. budget deficit. ...<con> Concept(s): 1. U.S. budget deficit, federal budget shortfall 2. foreign affairs budget, defense budget, en- titlements 3. increased revenues, tax increase, tax re- form, auction quota 4. reduction in expenditures, spending cuts, cutting domestic programs, eliminating government subsidies 5. NOT financing the U.S. budget deficit 173 <nat> Nationality: U.S. Given this topic, the gen[OCRerr]quely program generates the fol- lowing set of query strings: 007 000 U.S. 007 000 U.S. budget deficit 007 001 federal budget shortfall 007 002 foreign affairs budget 007 003 defense budget 007 004 entitlements 007 005 increased revenues 007 006 tax increase 007 007 tax reform 007 008 auction quota 007 009 reduction in expenditures 007 010 spending cuts 007 011 cutting domestic programs 007 012 eliminating government subsidies 007 013 NOT financing the U.S. budget deficit