SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Bayesian Inference with Node Aggregation for Information Retrieval
chapter
B. Del Favero
R. Fung
National Institute of Standards and Technology
D. K. Harman
Third, the priors are widely divergent from the intuition of
any user who reads a newspaper. The numbers suggest, for
instance, that about 1 out of every 100 articles in the WSJ
(and, by analogy, in the SJMN) are relevant to topic 57,
"MCI." Any reader of the WSJ or the SJMN knows that
this estimate is much too high - the relative frequency is
probably closer to 1 in 500 or 1 in 1000 than to 1 in 100.
Thus, the prior probabilities were assessed manually. It is
assumed that the user has some knowledge of the test
domain (in this case, articles in the SJMN) and can with
some thought assess the relative frequencies of various
states as a part of specifying the query for the particular
routing request. For each state, we ask the user what is the
average number of weeks between the publication of
articles relevant to that state. This number is presented in
the third column of Table 3.4. It can be converted to a prior
probability by combininbg it with the assumption that there
are 1000 documets per week.
The prior probability of U , p( U ), is calculated as one
minus the sum of the priors of all the other states, to ensure
that the probability of all states together is one.
3.2.4 Feature Conditional Probabilities
The inference algorithm requires, for each feature and for
each state, the conditional probability of the feature given
the state. These probabilities cannot be obtained directly
from the relative frequencies obtained from the training set,
because there are few documents that are relevant to more
than one topic at a time.
We approximate these probabilities by using a structure
called a noisy-or gate.
The noisy-or gate combines the effects of two or more
factors, each of which may contribute to the presence of a
feature. It is a model of disjunctive interaction, as
described in (Pearl, 1988). It has been used in medical
decision research to calculate the probability of a particular
symptom being present, given diseases that cause the
symptom (Heckerman, 1989).
In the context of information retrieval, a feature may be
present due to any of the topics tha[OCRerr] are relevant in the state.
For each state-feature pair, we build a noisy-or model. The
contributing factors are the topics that are relevant within
the state. The effect is the feature's presence or absence in
the document.
For example, consider a feature f and the state (57-97 98).
The feature may be present due to topics 57 or 98. It
cannot be present due to topic 97 because that topic is not
relevant within the state. Let El be the event that the
feature is present due to topic 57, and let E2 be the event
that the feature is present due to topic 98. Table 3.5 lists all
the possible cases of the two uncertain events. Figure 3.4
shows the belief network structure of the noisy-or model.
157
The node with the double wall is a deterministic logical or
gate.
Feature Feature Feature
Present due to Present due to Present at all
57 98 (E1ORE2)
(El) (E2) ____________
Yes Yes Yes
Yes No Yes
No Yes Yes
No No No
Table 3.5: Possible cases for a noisy-or node
PyE1E2
0
Figure 3.4: Belief Network corresponding to a Noisy-Or
Gate Model
The only case in Table 3.5 in which the feature is absent is
the fourth case. Thus, the conditional probability that the
feature is absent, given this state, is the probability of that
fourth case. The probability that the feature is present is
one minus the probability that the feature is absent.
3.3 Document Scoring
The Bayesian inversion described in Section 2.1 yields, for
each document, the posterior probability that the document
is relevant to each state. We calculate the posterior
probability that the document is relevant to each topic by
summing the posterior probabilities of all of the states in
which the topic appears.
For example, the posterior probability for topic 57 is the
sum of the posterior probabilities of the five states in which
topic 57 is relevant (refer to Table 3.4). The states are (57
97 98), (57 97-98), (57-97 98), (57-97-98-74) and
(57 74).
The final list of documents for each topic contains the top
1000 documents, ranked in descending order according to
the posterior probability that they are relevant to that topic.