SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Text Retrieval with the TRW Fast Data Finder
chapter
M. Mettler
National Institute of Standards and Technology
Donna K. Harman
at least three possible approaches that could be used in preparing queries for execution by
the Fast Data Finder.
* Parse the topic narratives, extract key terms and phrases, expand the
terms where possible, and generate queries to find documents with the
same combinations of terms.
* Take a sample of relevant documents, extract common keywords and
phrases, especially those the occur multiple times, and generate queries
to fmd documents with at least some of the same phrases and keywords
within a sliding window of text about the size of a paragraph.
Construct the initial queries manually and refine them iteratively.
We elected to try both methods (ii) and (iii). To supply the relevant documents for the
statistical trials, we used the sample relevance judgments supplied by NIST in late May and
early June.
3.1 Automatic Query Generation
Our plan was to take sample documents for a particular topic, merge them together, and
build a PSL query that would find similar documents. Using the single document
WSJ870320-()()62 as a seed, the query would be something like:
(30 words -[OCRerr]
5+ (`cola'; `coca'; `coca cola'; `bottling';
enterprises'; `cola bottling'; `cola enterprises';
coca cola enterprises' ; ` coca cola bottling'; `atlanta'))
This query finds a document which contains a 30 word sliding window with 5 or more of
the specified terms or phrases. The term list is determined by removing stopwords and
counting the number of occurrences for each term, 2 word phrase, and 3 word phrase in the
seed document. The top 10 terms/phrases with the highest counts are selected. The "30
words" and "5 or more" values were selected arbitrarily and we'd planned to run a series of
trials to determine the optimal values.
The initial experiments with this inethod of query construction were not encouraging.
ran into three difficulties.
* The May/June NIST sample relevance judgments seemed incomplete
and inaccurate and were not giving us the statistical base we'd hoped for.
* This method assumes that the whole document in all on one subject.
Longer seed documents were contributing terms that had little to do with
the topic. Some method to segment the documents and indicate the
interesting section is required.
* This method wasn't capturing the subtlety of the topics. The query
shown above does an excellent job of finding documents about Coca
Cola Enterprises or bottling units in Atlanta but completely misses the
part about antitrust violations because it is only mentioned once in the
article.
311
We