NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Knowledge-Based Searching with TOPIC chapter J. Lehman C. Reid National Institute of Standards and Technology D. K. Harman 3.4.1.3 TOPIC 67 Our analysis located weak topic formulation examples, such as query 67, illustrated in Figure 4. In this query, a set of optional, auxiliary evidence was "ANDed" with a small set of required evidence. The weight, or strength assigned to the auxiliary evidence was .05, which means that if all auxiliary terms were located, the highest possible score for a document would be .05, severely limiting the range of scores, and thus the occurrence of random false hits in the top 1000. To make a cosmetic improvement, only the value of the auxiliary evidence node was changed, to a value of .5, as shown in Figure 5. This change alone brought the Topic relevant document count to the median. 3.4.2 ADHOCTOPICS Overall, Verity's performance on the ad hoc topics was adequate. Performance was poorer than on the routing topics, but this is to be expected since there was less time available to build the Topics and no ground truth against which to test the Topic trees. The relevant comparison to the median is summarized in Figure 3. We count that 13 of the 50 results are at or above median. In contrast thought, there were only two outright failures here, topics 124 and 139. We did not look at topic 139, but topic 124 involves searching for documents that discuss innovative approaches to cancer therapy that do not involve any of the traditional treatments. This is a very hard topic because nearly all mentions of the innovative treatments are in the context of discussion of traditional therapies The approach adopted by Verity of simply looking for documents that talk about innovative treatment produces a large number of false hits (giving poor precision), and since there is an artificial cut-off at 1000 documents in the TREC experiments, this model also produces poor recall. We do not see an obvious solution to this. We picked three ad hoc topics to analyze in detail. 3.4.2.1 AD HOC TOPIC 109 A relevant document for this topic simply needs to mention one of a list of six companies given in the information need statement. A simple Topic that is the disjunction (OR) of the company names should be all that is needed here. However, the official result is: Relevant =742 Rel_ret=192 R-Precision = 0.2588 which is well below median. furthermore, given the simplicity of the topic, this is surprisingly low recall. Examination of the official Topic showed that company acronyms we used for three of the companies (i.e., 3M, OTC, ISI) were given equal weight to the fully spelled out company names. A cursory review of the original hit list showed that ISI was a poor choice since it has multiple interpretations. Less important, but for the same reason, OTC is a poor choice in the Wall Street Journal corpus since it can mean "over the counter", and in the DOE corpus 3M is part of a designator for a particular particle accelerator and is also used as an abbreviation for "three meters". We modified the Topic by eliminating the ISI acronym and by giving OTC and 3M reduced weights. This produced the following: Relevant = 742 Rel_ret=480 R-Precision = 0.5512 which would have been the best score. An interesting note here is that original and modified Topics had perfect precision and recall for the first 100 documents. Our conclusion is that this indeed was an easy topic - the false hits produced by ISI were what impacted Topics score. 3.4.2.2 AD HOC TOPIC 121 A relevant document for this document had to mention the death of a prominent U.S. citizen due to an identified form of cancer. This is an interesting topic consisting of two major components - the idea of a prominent citizen, and the idea of a specific cancer. In the official Topic, prominence was modeled using a number of words that indicate prominence (e.g., "prominent", "celebrity") together with words that indicate prominent roles (e.g., "Nobel Prize", "actor", "actress"). Cancer death was modeled by various combinations of death words (e.g., "death", "died") and cancer words (e.g., "cancer", "tumor", "leukemia"). The official score was: Relevant = 55 Rel_ret =27 R-Precision =0.1455 which, while not good in absolute terms, was well above the median. We observed two problems with this definition. First, it uses generic cancer terms rather than the specific cancer types required by the information need statement. So, we made all the cancer terms specific by using a list of common cancers (e.g., lung cancer, breast cancer, stomach cancer, etc.). We made no attempt to make 216