SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
N-Gram-Based Text Filtering For TREC-2
chapter
W. Cavnar
National Institute of Standards and Technology
D. K. Harman
Topics
Generate
`siries
Queries
Compressed
Documents
Uncompressed
Documents
Relevance Scores
Find
Top
Choices
Top Choice Scores
Figure 1: Dataflow Diagram for N-Gram-Based Text
Retrieval
2e1 Generating Query Strings From TREC
Topics
The process labelled "Generate Queries" represents a pro-
gram gen[OCRerr]query, which takes the TREC topics file and gen-
erates a set of query strings for each topic. We considered
several different schemes for extracting query strings from
the topics, but finally settled on using just the phrases in the
concept and nationality sections. The following is a typical
topic from ThEC-l:
<num> Number 007
<title> Topic: U.S. Budget Deficit
<desc> Description: Document will mention a
proposal to decrease the U.S. budget deficit.
...<con> Concept(s):
1. U.S. budget deficit, federal budget shortfall
2. foreign affairs budget, defense budget, en-
titlements
3. increased revenues, tax increase, tax re-
form, auction quota
4. reduction in expenditures, spending cuts,
cutting domestic
programs, eliminating government subsidies
5. NOT financing the U.S. budget deficit
173
<nat> Nationality: U.S.
Given this topic, the gen[OCRerr]quely program generates the fol-
lowing set of query strings:
007 000 U.S.
007 000 U.S. budget deficit
007 001 federal budget shortfall
007 002 foreign affairs budget
007 003 defense budget
007 004 entitlements
007 005 increased revenues
007 006 tax increase
007 007 tax reform
007 008 auction quota
007 009 reduction in expenditures
007 010 spending cuts
007 011 cutting domestic programs
007 012 eliminating government subsidies
007 013 NOT financing the U.S. budget deficit