SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Text Retrieval with the TRW Fast Data Finder chapter M. Mettler National Institute of Standards and Technology Donna K. Harman at least three possible approaches that could be used in preparing queries for execution by the Fast Data Finder. * Parse the topic narratives, extract key terms and phrases, expand the terms where possible, and generate queries to find documents with the same combinations of terms. * Take a sample of relevant documents, extract common keywords and phrases, especially those the occur multiple times, and generate queries to fmd documents with at least some of the same phrases and keywords within a sliding window of text about the size of a paragraph. Construct the initial queries manually and refine them iteratively. We elected to try both methods (ii) and (iii). To supply the relevant documents for the statistical trials, we used the sample relevance judgments supplied by NIST in late May and early June. 3.1 Automatic Query Generation Our plan was to take sample documents for a particular topic, merge them together, and build a PSL query that would find similar documents. Using the single document WSJ870320-()()62 as a seed, the query would be something like: (30 words -[OCRerr] 5+ (`cola'; `coca'; `coca cola'; `bottling'; enterprises'; `cola bottling'; `cola enterprises'; coca cola enterprises' ; ` coca cola bottling'; `atlanta')) This query finds a document which contains a 30 word sliding window with 5 or more of the specified terms or phrases. The term list is determined by removing stopwords and counting the number of occurrences for each term, 2 word phrase, and 3 word phrase in the seed document. The top 10 terms/phrases with the highest counts are selected. The "30 words" and "5 or more" values were selected arbitrarily and we'd planned to run a series of trials to determine the optimal values. The initial experiments with this inethod of query construction were not encouraging. ran into three difficulties. * The May/June NIST sample relevance judgments seemed incomplete and inaccurate and were not giving us the statistical base we'd hoped for. * This method assumes that the whole document in all on one subject. Longer seed documents were contributing terms that had little to do with the topic. Some method to segment the documents and indicate the interesting section is required. * This method wasn't capturing the subtlety of the topics. The query shown above does an excellent job of finding documents about Coca Cola Enterprises or bottling units in Atlanta but completely misses the part about antitrust violations because it is only mentioned once in the article. 311 We