NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Bayesian Inference with Node Aggregation for Information Retrieval chapter B. Del Favero R. Fung National Institute of Standards and Technology D. K. Harman fail to represent possible relationships between the topics. As a consequence, the acquisition of consistent feature- topic probabilities and the interpretation and comparison of multiple topic probabilities are problematic. For example, it would be impossible using this framework to compute the probability that a document was relevant to two selected topics or to compute the probability that a document was relevant to at least one of two selected topics. Bayesian networks can be used to explicitly represent the relationship between topics. For example, consider Figure 2.3 in which the two topics ti and t2 are related. x?sž 0 Figure 2.4: Multiple-topic query represented as a single compound-topic query 3 Retrieval Architecture This section describes the information retrieval architecture we have developed. Figure 2.3: Information Retrieval Model with Two Related Topics The Bayesian inference problem with multiple topics becomes more complicated than before. To compute the posterior probability of each topic, all the other topics must be removed through marginalization as well as reversing the topic-feature arcs as before. In addition, any joint distribution between the topics can be computed. To perform these computations, a general inference algorithm is needed. However, there is a way to simply a multiple topic network to look like a single topic network (Figure 2.1). The topics can be combined into a compound topic (node 5 in Figure 2.4) (Chang & Fung, 1989) whose range is all possible present-or-absent combinations of ti and t2. An advantage of this representation is that the same simple computational formulas can be used as before. A disadvantage of this representation is that the intermediate query, 5, must contain (in the worst case) 2n states, representing all possible present-or-absent combinations of its n parent topics. However, this worst case is rarely seen, because the relationships between topics are typically sparse. Building this intermediate query is one of the innovations of our research and is described in Section 3.2. The inputs to the system are the descriptions of topics of interest and a set of documents to test. The output of the system is list of documents ranked by their degree of relevance to each topic. The system computes the degree of relevance as the probability that a document is relevant to a topic, for every document-topic pair. To help in this task, the system is given a training set of documents to which relevance judgements have been attached. There are several components to our retrieval system. The feature extraction component, described in Section 3.1, translates a document from its raw text form to a simpler internal representation that the system can use. The query construction component, described in Section 3.2, translates the description of a set of topics into an internal representation. The document scoring component, described in Section 3.3, uses Bayesian inference to calculate a measure of a document's relevance to a topic, given the internal representations of both document and topic. 3.1 Featnre Selection and Extraction Any observable characteristic of the text of a document that may be clearly defined (e.g., as either present or absent) may be regarded as a feature. Our system looks for two types of features in the text of a document: single words and pairs of adjacent single words. If a feature appears at least once in a document, it is counted as "present." Otherwise, it is counted as "absent." The internal representation of a document is therefore a binary vector, each element indicating the presence or absence of the corresponding feature in the document. The system removes many common suffixes from a word using Paice's stemming rules (Paice, 1977). This means, for example, that if either of the words "walking" or 153