SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Bayesian Inference with Node Aggregation for Information Retrieval
chapter
B. Del Favero
R. Fung
National Institute of Standards and Technology
D. K. Harman
fail to represent possible relationships between the topics.
As a consequence, the acquisition of consistent feature-
topic probabilities and the interpretation and comparison of
multiple topic probabilities are problematic. For example,
it would be impossible using this framework to compute the
probability that a document was relevant to two selected
topics or to compute the probability that a document was
relevant to at least one of two selected topics.
Bayesian networks can be used to explicitly represent the
relationship between topics. For example, consider Figure
2.3 in which the two topics ti and t2 are related.
x?sū
0
Figure 2.4: Multiple-topic query represented as a single
compound-topic query
3 Retrieval Architecture
This section describes the information retrieval architecture
we have developed.
Figure 2.3: Information Retrieval Model with Two
Related Topics
The Bayesian inference problem with multiple topics
becomes more complicated than before. To compute the
posterior probability of each topic, all the other topics must
be removed through marginalization as well as reversing
the topic-feature arcs as before. In addition, any joint
distribution between the topics can be computed. To
perform these computations, a general inference algorithm
is needed.
However, there is a way to simply a multiple topic network
to look like a single topic network (Figure 2.1). The topics
can be combined into a compound topic (node 5 in Figure
2.4) (Chang & Fung, 1989) whose range is all possible
present-or-absent combinations of ti and t2. An advantage
of this representation is that the same simple computational
formulas can be used as before.
A disadvantage of this representation is that the
intermediate query, 5, must contain (in the worst case) 2n
states, representing all possible present-or-absent
combinations of its n parent topics. However, this worst
case is rarely seen, because the relationships between topics
are typically sparse. Building this intermediate query is one
of the innovations of our research and is described in
Section 3.2.
The inputs to the system are the descriptions of topics of
interest and a set of documents to test. The output of the
system is list of documents ranked by their degree of
relevance to each topic. The system computes the degree
of relevance as the probability that a document is relevant
to a topic, for every document-topic pair. To help in this
task, the system is given a training set of documents to
which relevance judgements have been attached.
There are several components to our retrieval system. The
feature extraction component, described in Section 3.1,
translates a document from its raw text form to a simpler
internal representation that the system can use. The query
construction component, described in Section 3.2, translates
the description of a set of topics into an internal
representation. The document scoring component,
described in Section 3.3, uses Bayesian inference to
calculate a measure of a document's relevance to a topic,
given the internal representations of both document and
topic.
3.1 Featnre Selection and Extraction
Any observable characteristic of the text of a document that
may be clearly defined (e.g., as either present or absent)
may be regarded as a feature. Our system looks for two
types of features in the text of a document: single words
and pairs of adjacent single words. If a feature appears at
least once in a document, it is counted as "present."
Otherwise, it is counted as "absent."
The internal representation of a document is therefore a
binary vector, each element indicating the presence or
absence of the corresponding feature in the document.
The system removes many common suffixes from a word
using Paice's stemming rules (Paice, 1977). This means,
for example, that if either of the words "walking" or
153