NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval chapter P. Jacobs National Institute of Standards and Technology D. K. Harman 2.2.1 "Manual" queries for routing 2.2.2 "Hard" vs. "Soft" Booleans In manual routing, our approach uses a statistical cor- pus analysis, developed originally for text categorization [4], to pull out terms based on their relative frequency in relevant documents for each topic. The statistic used combines the entropy-based mutual information statistic (testing the independence of each term with each topic) with a correction for low-frequency terms and for ambigu- ous words. Words with high weights have a high degree of association with a topic. This statistical analysis is also used in ranking. The base weighting formula is the following: C(log2 b)(log2 v) where C is a constant, 6 is number of times a term appears in a story assigned to a particular category, for example, and log2 r is the log of the ratio of combined probabilities (i.e., of a particular word or phrase occur- ring in a text about a particular category) to the prod- uct of independent probabilities[OCRerr]the mutual information statistic. This tests the assumption that the use of the word and the category of the text are independent. When this assumption is false, the word gets a high positive or negative weight. For example, the following are the top words for Topic 51, "Airbus subsidies": A-330 TOPIC51 1263.3 AIRBUS TOPIC51 1183.1 A-340 TOPIC51 1178.2 INDUSTRIE TOPIC51 1071.9 A-320 TOPIC51 1067.5 MESSERSCHMITT-BOELKOW-BLOHN TOPIC51 851.3 AERONAUTICAS TOPIC51 843.3 CONSTRUCCIONES TOPIC51 807.9 AEROSPATIALE TOPIC51 762.5 MBB TOPIC51 722.6 WIDE-BODY TOPIC51 617.7 MD-il TOPIC51 613.8 TOULOUSE TOPIC51 228.5 JETLINERS TOPIC51 217.8 LUFTHANSA TOPIC51 196.6 MD-80 TOPIC51 196.6 Clearly, these words all have some reason to be associ- ated with this topic, but adding them to the appropriate group in each query (or ignoring them entirely) is a "man- ual" process. Our manual routing queries, therefore, are a combination of the regular expressions that were devel- oped from the topic descriptions with terms added that were selected from the automatic training. This is, we believe, a very practical manual approach that has very good performance. The Boolean matcher uses a "hard" Boolean approach, in that it will admit only texts, for each query, that satisfy the conditions of that query. For example, in Topic 51 above, "Airbus subsidies", the matcher will allow texts only that have boih and Airbus term and a subsidy term in the same paragraph. However, this is a narrow topic, and TREC-2 allows each system to produce 1000 texts for each topic. The evaluation metrics offer no penalty for filling up the list of 1000 with texts that are likely to be irrelevant. So, in order to provide increased flexibility and consider larger numbers of texts for each topic, we used an additional engine only for the purpose of pulling in texts for very specific queries like this one. The system is still a hard Boolean system in that texts that satisfy the Boolean conditions will always be ranked higher than texts that do not satisfy the conditions; how- ever, texts that do not satisfy the conditions can appear on the final ranked lists. The "soft" engine considers such texts by relaxing some of the Boolean conditions, effec- tively pulling in texts that have a large number of terms that match the query, but do not necessarily meet all the conditions. This component of the system is more like statistical retrieval engines; however, it does not have a large impact on the overall scores, because it only af- fects the results at the low-precision extreme (the low- est rankings) for queries that match very few documents. In fact, for Topic 51, the hard Boolean query matches 11 texts, and the soft method pulls in an additional 989 texts. But there are only 11 texts that are judged rele- vant, of which 10 satisfy the hard Boolean. So enforcing the hard Boolean condition seems to work well for this topic, and the soft Boolean doesn't contribute much. For some topics, the balance between the hard Boolean condition and the soft Boolean isn't so clear; i.e. not en- forcing the hard condition would lead to better ranking. This seems be a function of how well the topics fit with Boolean expressions in general. "Airbus subsidies" is re- ally a Boolean topic, in that a relevant text must say something about Airbus and something about subsidies. Other topics like "automation" are much harder to ex- press in a Boolean form. We will cover these issues in topic-by-topic performance later in this paper. 2.2.3 "Automatic" queries for ad hoc retrieval The "m?[OCRerr]ual" query method is partly automated in the sense that the corpus-based statistical training suggests many of the terms that are used in the queries. But it is "manual" in that the initial formulation of the queries is done manually from the TREC topic descriptions. We have tried, as a simple experiment, to generate the Boolean term groupings and expand each term automat- ically from the topic descriptions. In the days before the TREC-2 ad hoc test, we tried several different ways of do- 195