NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Bayesian Inference with Node Aggregation for Information Retrieval chapter B. Del Favero R. Fung National Institute of Standards and Technology D. K. Harman "walks" is present in a document, the system considers the root word "walk" to be present in the document. The system must be given a list of features for which to look. This target list was constructed in three steps. First, a list of candidate features was generated from the descriptions of the 50 TREC-2 routing topics. The text of the descriptions was broken up into individual words, from which the suffixes (if any) were removed. Duplicate words and common words (stop words) were removed. A number of two-word features (such as phrases and proper names) were identified by hand. This procedure created a list of about 1400 candidate features. Second, the system extracted the relative frequency information for each of these features from the training documents. Then, for each topic (the 10 plus an additional topic representing "none of the above,") the system sorted the candidate features in descending order according to the F4 formula of Robertson and Sparck Jones (Robertson & Sparck Jones, 1976), which we used is a measure of the ability of a feature to characterize a topic. Third, the top 30 features for each topic (and the top 60 features for "none of the above") were combined into a single list. After removing duplicates, this yielded the final list of 229 features. We tried numerous feature selection strategies and settled on this one as the most satisfactory. 3.2 Query Representation and Construction In preparation for the inference step, the 10 single-topic queries are combined into one multiple-topic query. As mentioned in Section 2.3, this aggregation can in the worst case require 210 different states in the multiple-topic query. In this section, we describe how to reduce the number of states required by considering the relationships between the topics. Once the states of the query have been structured, we must assign numerical values to the Bayesian model. We describe how to generate estimates of the prior probabilities of each of these states in Section 3.2.3. We describe in Section 3.2.4 how to define, for each feature and state, the conditional probability that the feature appears in a document given that the document is relevant to that state. 3.2.1 Pairwise Relationships Between Topics There are six possible pairwise relationships between any two topics t1 and t2: 1. t1 and t2 are mutually exclusive: if there is no document relevant to both topics 2. t1 is a subset of t2 if all documents relevant to t1 are also relevant to t2 3. t2 is a subset of t1 if all documents relevant to t2 are also relevant to t1 4. t1 and t2 are equivalent if t1 and t2 are subsets of each other (they satisfy Relations 2 and 3) 5. t1 and t2 are dependent if knowing that a document is relevant to t1 gives you some information about whether the document is relevant to t2 6. t1 and t2 are independent if knowing that a document is relevant to t1 gives you no information on whether the document is relevant to t2 Each of the Relations 1-5 is a type of dependence between topics. In a belief network, topics satisfying Relations 1-5 would be connected by an arc. To ensure that two topics satisfy only one of the relations, Relation 5 (dependence) is defined as any type of dependence that is different from those of Relations 1-4. Relation 6, independence, is represented in a belief network by the absence of an arc between the topics. The distinction between dependence (Relation 5) and independence (Relation 6) is useful in calculating the probabilities of combinations of the topics, as described in Section 3.2.3. Relations 1-4 can be identified by testing whether the defining conditions are satisfied in the training set. If the topics satisfy none of the first four relations, then a chi- square test can distinguish between Relations 5 and 6. If there are too few documents with relevance judgements for both topics to make reliable conclusions (our cutoff was 13 documents), the system makes an assessment, then prompts the user to verify it manually. To fully explore the pairwise relationships between topics, we selected ten of the fifty TREC-2 routing topics to use in our models and tests. The topics are listed in Table 3.1. 154