SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Bayesian Inference with Node Aggregation for Information Retrieval
chapter
B. Del Favero
R. Fung
National Institute of Standards and Technology
D. K. Harman
"walks" is present in a document, the system considers the
root word "walk" to be present in the document.
The system must be given a list of features for which to
look. This target list was constructed in three steps. First, a
list of candidate features was generated from the
descriptions of the 50 TREC-2 routing topics. The text of
the descriptions was broken up into individual words, from
which the suffixes (if any) were removed. Duplicate words
and common words (stop words) were removed. A number
of two-word features (such as phrases and proper names)
were identified by hand. This procedure created a list of
about 1400 candidate features.
Second, the system extracted the relative frequency
information for each of these features from the training
documents. Then, for each topic (the 10 plus an additional
topic representing "none of the above,") the system sorted
the candidate features in descending order according to the
F4 formula of Robertson and Sparck Jones (Robertson &
Sparck Jones, 1976), which we used is a measure of the
ability of a feature to characterize a topic.
Third, the top 30 features for each topic (and the top 60
features for "none of the above") were combined into a
single list. After removing duplicates, this yielded the final
list of 229 features.
We tried numerous feature selection strategies and settled
on this one as the most satisfactory.
3.2 Query Representation and Construction
In preparation for the inference step, the 10 single-topic
queries are combined into one multiple-topic query. As
mentioned in Section 2.3, this aggregation can in the worst
case require 210 different states in the multiple-topic query.
In this section, we describe how to reduce the number of
states required by considering the relationships between the
topics.
Once the states of the query have been structured, we must
assign numerical values to the Bayesian model. We
describe how to generate estimates of the prior probabilities
of each of these states in Section 3.2.3. We describe in
Section 3.2.4 how to define, for each feature and state, the
conditional probability that the feature appears in a
document given that the document is relevant to that state.
3.2.1 Pairwise Relationships Between Topics
There are six possible pairwise relationships between any
two topics t1 and t2:
1. t1 and t2 are mutually exclusive: if there is no
document relevant to both topics
2. t1 is a subset of t2 if all documents relevant to t1 are
also relevant to t2
3. t2 is a subset of t1 if all documents relevant to t2 are
also relevant to t1
4. t1 and t2 are equivalent if t1 and t2 are subsets of
each other (they satisfy Relations 2 and 3)
5. t1 and t2 are dependent if knowing that a document
is relevant to t1 gives you some information about
whether the document is relevant to t2
6. t1 and t2 are independent if knowing that a
document is relevant to t1 gives you no information
on whether the document is relevant to t2
Each of the Relations 1-5 is a type of dependence between
topics. In a belief network, topics satisfying Relations 1-5
would be connected by an arc. To ensure that two topics
satisfy only one of the relations, Relation 5 (dependence) is
defined as any type of dependence that is different from
those of Relations 1-4.
Relation 6, independence, is represented in a belief network
by the absence of an arc between the topics.
The distinction between dependence (Relation 5) and
independence (Relation 6) is useful in calculating the
probabilities of combinations of the topics, as described in
Section 3.2.3.
Relations 1-4 can be identified by testing whether the
defining conditions are satisfied in the training set. If the
topics satisfy none of the first four relations, then a chi-
square test can distinguish between Relations 5 and 6. If
there are too few documents with relevance judgements for
both topics to make reliable conclusions (our cutoff was 13
documents), the system makes an assessment, then prompts
the user to verify it manually.
To fully explore the pairwise relationships between topics,
we selected ten of the fifty TREC-2 routing topics to use in
our models and tests. The topics are listed in Table 3.1.
154