SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Overview of the Second Text REtrieval Conference (TREC-2) chapter D. Harman National Institute of Standards and Technology D. K. Harman Cornell University (see Buckley, Allan & Salton paper), but using less than optimal term weightings (1,y mistake). The 1tdortQ2'1 results from the University of Dortmund come from using polynoinial regression on the training data to find weights for various pre-set term features (see Fuhr, Pfeifer, Brenikamp, Polimann & Buckley paper). The "Brkly3'9 results from the Univ&sity of California at Berkeley come froin performing logistic regression analy- sis to learn optimal weighting for various term frequency measures (see Cooper, Chen & Gey paper). The "CLARTA" system from the CLARIT Corporation expands each topic with noun phrases found in a the- saurus that is automatically generated for each topic (see Evans & Lefferts paper). The "lsiasm" results are from Bellcore (see Dumals pape[OCRerr]. This group uses latent semantic indexing to create much larger vectors than the more traditional vector-space models such as SMART. The run marked "lsiasm91 represents only the base SMART pre-processing results, however. Due to process- ing errors the "improved'9 LSI run produced unexpectedly poor results. Figure 6 shows the recall/precision curve for the six TREC-2 groups with the highest non-interpolated average precision using manual construction of queries. It should be noted that varying amounts of manual intervention were used. The results marked "INQOO2", "siems2", and "CLARIM" are automatically generated queries with manual modifications. The "INQOO2" results reflect vari- ous manual modifications made to the "INQOOl" queries, with those modifications guided by strict rules. The "siems2" results from Siemens Corporate Research, Inc. (see Voorhees paper) are based on the use of the Cornell SMART system, but with the topics manually modified (the "not91 phases removed). These results were meant to be the base run for improvements using WordNet, but the improvements did not materialize. The "CLARTM19 results represent manual weighting of the query terms, as opposed to the automatic weighting of the terms that was used in "CLARTA." The results marked "Vtcms2", "CnQs[OCRerr]", and "TOPI[OCRerr]" are produced from queries con- structed completely manually. The "Vtcms211 results are from [OCRerr]nginia Tech (see Fox & Shaw paper) and show the effects of combining the results from SMART vector- space queries with the results from manually-constructed soft Boolean P-Norm type queries. The "CnQs[OCRerr]" results, from ConQuest Software (see Nelson paper), use a very large general-purpose semantic net to ald in constructing better queries from the topics, along with sophisticated morphological analysis of the topics. The results marked 19T0P1C219 are from the TOPIC system by Verity Corp. (see Lehman paper) and reflect the use of an expert sys- tem working off specially[OCRerr]onstructed knowledge bases to improve performance. 10 Several comments can be made with respect to these adhoc results. First, the better results (most of the auto- matic results and the three top manual results) are very similar and it is unlikely that there is any statistical differ- ences between them. There is clearly no "best" method, and the fact that these systems have very different approaches to retrieval, including different term weighting schemes, different query construction methods, and differ- ent similarity match methods implies that there is much more to be learned about effective retrieval techniques. As will be seen in section 6, whereas the averages for the systems may be similar, the systems do better on different topics and retrieve different subsets of the relevant docu- ments. A second point that should be made is that the automatic query construction methods continue to perform as well as the manual construction methods. Two groups (the INQUERY system and the CLARIT system) did explicit comparision of manually-modified queries vs those that were not modified and concluded that manual modifica- tion provided no benefits. The three sets of results based on completely manually-generated queries had even poorer performance than the manually-modified queries. Note that this result is specific to the very rich TREC top- ics; it is not clear that this will hold for the short topics normally seen in other retrieval environments. As a final point, it should be noted that these adhoc results represent significant improvements over the results from TREC-1. Figure 7 shows a comparison of results for a typical system in TREC-1 and TREC-2. Some of this improvement is due to improved evaluation, but the differ- ence between the curve marked "TREC-1" and the curve marked "TREC-2 looking at top 200 ouly" shows signifi- cant performance improvement. Wbereas this improvement could represent a difference in topics (the TREC-l curve is for topics 51-100 and the TREC-2 curves are for topics 101-150), the TREC-2 topics are generally felt to be more difficult and therefore this improvement is likely to be an understatement of the actual improvements. Only two groups worked with less than the full document collection. Figure 9 shows the results for the one group with official TREC-2 category B results (the results from UCLA were received after the deadline). This figure shows the best results from New York University (see Strzalkowski & Carballo paper), compared with a cate- gory B version of the Cornell SMART results. The "nyuir3" results reflect a very intensive use of natural lan- guage processing (NLP) techniques, including a parse of the documents to help locate syntactic phrases, context- sensitive expansion of the queries, and other NLP improvements on statistical techniques.