SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) GE in TREC-2: Results of a Boolean Approximation Method for Routing and Retrieval chapter P. Jacobs National Institute of Standards and Technology D. K. Harman mg this automatic Boolean query generation, and chose the one that worked best on the sample data. Our first attempt was to use the common methods for finding collocations and word associations in sentences, and these worked horribly for term expansion. The prob- lem is that this approach finds more associations like "fu- neral" and "home" than it does "hostage" and "captive", and the latter, text-level associations are what's required to generate good queries. The "solution" we tried was, for a sample of about 10 million words in the corpus, to choose the top 20 words based on TF.IDF weights for each document, store the frequency of association among these terms, and then weight each pair using the weighted mutual information statistic of the previous section. This was much better than using sentence-level information, although it is still a very straightforward approach. For example, the fol- lowing are the top 10 terms associated with the word "hostage" (in order): hostages Lebanon Beirut Iran release Terry kidnappers kidnapped Jihad Anderson While this is certainly not the optimal set of terms to use in place of "hostage",it is a good start. The next problem in automatic query construction is when to use a combination of terms and when to use a single term. For example, the term "weather-related fatalities" is a combination of two word groups (weather and fatalities) while "Iran-contra affair" is really only one group (Iran-contra), even though it might appear that "affair" is a significant term. Again we took the direct approach, choosing to com- bine terms whenever there was a reasonable percentage of overlap between their associated terms. This worked sur- prisingly well in cases where the topic title was a good de- scription (e.g. "welfare reform") and very badly for those with vague titles (e.g. "find innovative companies"). We tried to recover from these by including more words from the description and narrative, but then we had to start recognizing the language of these descriptions, filtering out words like "relevant", "mention" and so forth. At this crude stage, the main problem with the query generation method is using the structure of the topic descriptions. The second major issue with automatic query genera- tion is that it isn't nearly as good at finding good terms as the process of training from data and relevance judge- ments, as used in the routing experiments. The relevance judgements used for routing contain large volumes ofrela- tively high-accuracy data, while the training used for term expansion in query generation relied on relatively small volumes of relatively noisy data. For example, the word "welfare" used in one of the ad hoc topics occurred with a high enough TF.IDF weight only 29 times in the training sample, and the most frequently associated term, "chil- dren", occurred only 6 times. In order to establish good associations between "welfare" and less frequent terms, we would need much more data. The data from TREC-2 seem to suggest that low-frequency terms contribute more in term expansion than high-frequency terms, so using a "small" training sample (10 million words is only about 3% of the corpus) was a major error. We made many other mistakes in the training method, including mixing samples from the Federal Register and DOE sources with other texts that are much more likely to be relevant. This leaves a lot of room for future experiments and improve- ment. The fully automatic ad hoc system certainly didn't do as well as the manual routing system, but it was still at or above median for more than half of the ad hoc topics. Considering that this method could be used within the context of most any legacy retrieval system, the result is worth noting. Furthermore, the generation of Boolean queries from natural language descriptions is an interest- ing, as well as practical, research problem, because many different retrieval systems can make some use of Boolean queries. 3 Ranking In both routing and ad hoc, we used a set of word weights for ranking, acquired using the relevance judgements in the routing case and from the corpus data in the ad hoc case. In routing, the weights reflect the statistical mea- sure of association between the term and each topic (using the weighted mutual information score given earlier). In the ad hoc case, the weight is a function of the frequency of the term in the topic description, the inverse collec- tion frequency, and an additional factor to weight certain components of the topic descriptions (such as the title and description) more heavily than others. We combined the weighted frequency of these terms with an overall count of the number of topic hits per document, normalizing for document length, to produce a score for each document. This was the result of trying many different approaches on the test data, so it was definitely a good method for our system. ilowever, in comparing our results with those of other systems, our precision curve across various recall points is not nearly as good as a system that does really good ranking. In routing, we are not sure that ranking is im- portant, but it is certainly important in getting good re- sults in TREC. So, we are inclined to try to combine our 196