SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-II Routing Experiments with the TRW/Paracel Fast Data Finder
chapter
M. Mettler
National Institute of Standards and Technology
D. K. Harman
4.0 Statistical Query Generation
Our second set of experiments revolved around the use of term weighting. Our basic
approach was to follow the well researched path of generating term weights proportional to
the occurance of words in a sample of relevant text and inversely proportional to the
occurance of words in the database as a whole. Since we were doing the routing topics,
we gathered stafistics on Volume II and hoped that they would be valid for Volume III. For
sample relevant documents we used the NIST TREC-I relevant documents from Volume
II. We did not use the topic narratives or descriptions. In addition to single word terms,
we also considered two and three word phrases. We used the FDF itself to scan the training
corpus and determine the phrase frequencies of interest.
To adapt this standard approach to the FDF, we needed to make three algorithmic
modifications. First, we needed to adapt to the limitations imposed by the FDF hardware.
While extremely effective for pattern matching, the FDF is not a general purpose computer.
While the FDF processor cells can perform basic addition/subtraction, the datapath
available to accumulate an aggregate score for a document is limited to 8 or 9 bits. Thus
we had to restrict the term weights (and the range of their sums) to integer values between
0 and 255 or 0 and 511. This had the effect of truncating most topic's query terms at 10-
20 terms (words, phrases, or special features). We also excluded terms from our queries
that did not appear in at least 30% of the relevant sample documents.
Second, we were striving to not give up the strengths of the FDF's pattern matching
capabilities to pinpoint special features in the text which have a large impact on document
relevance. We manually reviewed the topics and prepared special feature subqueries in an
attempt to increase the precision for particular topics. For Topic 59, Weather Related
Fatalities, we manually prepared a special feature subquery to detect phrases detailing a
numeric value of people killed. We determined the frequency of each special feature, both
in the sample relevant documents and in the training database as a whole, and just added
these into the word list as if they were regular single word terms. In some instances our
manually prepared subqueries jumped to the top of the list of statistically relevant terms for
a topic; in others they didn't.
Third, we observed that some topics had particular words, phrases, or special features that
were present in almost all relevant documents. We converted terms that occurred in >90%
of the relevant documents to boolean ANDs in our queries. This was intended to improve
precision for topics like 62, Military Coups D'etats. The topic narrative specifically stated
that the country involved must be named. One of our special feature subqueries was a list
of known foreign country names. While of no statistical significance as a term, this
subquery did hit on almost every sample document. We thus ANDed it into the query as a
required boolean term.
Table II shows a sample statistical query. DocCount is the number of documents in the
sample that included the term, phrase, or special feature. DbCount is the number of
documents in our training sample that included the term. Weight is DocCount divided by
DbCount. PslWeight is the integer coefficient based on the Weight. The relevant
documents retrieved by the statistically generated queries are labeled TRW2 in Table III.
204
f