SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Overview of the Second Text REtrieval Conference (TREC-2) chapter D. Harman National Institute of Standards and Technology D. K. Harman 53 Routing Results The routing evaluation used a subset of the training topics (topics 51-100 were used) against the new disk of test documents (disk 3). There were 40 sets of results for routing evaluation, with 32 of them based on runs for the flill data set. of the 32 systems using the flill data set, 23 used automatic construction of queries, and 9 used man- ual construction. Figure 9 shows the recall/precision curves for the six TREC-2 groups with the highest non-interpolated average precision using automatic construction of the routing queries. Again three systems are based on the Cornell SMART system. The plot marked "crnlCl" is the actual SMART system, using the basic Rocchio relevance feed- back algorithms, and adding many terms (up to 500) from the relevant traing documents to the terms in the topic. The "dortPl" results come from using a probabilistically- based relevance feedback instead of the vector-space algo- rithm, and adding only 20 terms from the relevant docu- ments to each query. These two systems have the best routing results. The "Brkly5" system uses logistic regres- sion on both the general frequency variables used in their adhoc approach and on the query-specific relevance data available for training with the routing topics. The results marked "cityr2" are from City University, London (see Robertson, Walker, Jones, Hancock-Beaulieu & Gafford paper). This group automatically selected variable num- bers of terms (1[OCRerr]25) from the training documents for each topic (the topics themselves were not used as term sources), and then used traditional probabilistic reweight- ing to weight these terms. The "INQOO3" results also use probabilistic reweighting, but use the topic terms, expanded by 30 new terms per topic from the training documents. The results marked "lsir2" are more latent semantic indexing results from Beilcore. This run was made by creating a filter of the singular-value decomposi- tion vector sum or centroid of all relevant documents for a topic (and ignoring the topic itself). Figure 10 shows the recall4,recision curves for the six TREC-2 groups with the highest non-interpolated average precision using manual construction of the routing queries. The results marked "INQOOI" are from the INQRY system using an inferential combination of the "INQ003" queries and manually modified queries created from the topic. The "trw2" results represent an adaptation of the TRW Fast Data Finder pattern matching system to allow use of term weighting (see Mettler paper). The queries were manually constructed and the term weight- ing was learned from the training data. The "geerdi" results from GE Research and Development Center (see Jacobs paper) also come from manually constructed queries, but using a general-purpose lexicon and the train- ing data to suggest input to the Boolean pattern matcher. 13 The results marked "CLARThI" are similar to the "CLARTM" adhoc results except that the training docu- ments were used as the source for thesaurus building, as opposed to using the top set of retrieved documents. The "rutcombx" results from Rutgers University (see Belitin, Kantor, Cool & Quatrain paper) come from combining 5 sets of manually generated Boolean queries to optimize performance for each topic. The results marked "TOPIC2" are from the TOPIC system and reflect the use of an expert system working off specially-constructed knowledge bases to improve performance. As was the case with the adhoc topics, the automatic query construction methods continue to perform as well as, or m this case, better than the manual construction methods. A comparision of the two INQRY runs illus- trates this point and shows that all six results with manu- ally generated queries perform worse than the six runs with automatically-generated queries. The availability of the training data allows an automatic tuning of the queries that would be difficult to duplicate manually without extensive analysis. Unlike the adhoc results, there are two runs ("crnlCl" and "dor[OCRerr]1") that are clearly better than the others, with a sig- nificant difference between the "crnlCl" results and the "do[OCRerr]1" results and also significant differences between these results and the rest of the automatically-generated query results. In particular the Cornell group's ability to effectively use many terms (up to 500) for query expan- sion was one of the most interesting findings in I1[OCRerr]EC-2 and represents a departure from past results (see Buckley, Allan, & Salton paper for more on this). As a final point, it should be noted that the routing results also represent significant improvements over the results from ThEC-l. Figure 11 shows a comparison of results for a typical system in JREC- 1 and ThEC-2. Some of this improvement is due to the improved evaluation tech- niques, but the difference between the curve marked "ThEC-1" and the curve marked "TREC-2 looking at top 200 only" shows significant performance improvement. There is even more improvement for the routing results than for the adlioc results, due to better training data (mosfly non-existent for ThEC-1) and to major efforts by many groups in new routing algorithm experiments. Only four groups worked with less than the full document collection. Figure 12 shows the results for two of the groups m category B compared with a category B version of the Cornell SMART results. These curves show the results of runs from New York University (that were done in a similar method as that used for the adhoc results) and results from Dathousie University.