SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) chapter R. Tong L. Appelbaum National Institute of Standards and Technology D. K. Harman At 200 docs: 0.8900 At 500 docs: 0.6220 At 1000 docs: 0.3120 R-Precision (precision after R (= num[OCRerr]re1 for a query) docs retrieved): Exact: 0.8493 So that although recall decreased slightly (we now retrieve 312 rather than 328 of the 345 relevant documents), the pre- cision is improved by nearly 50 percentage points. Obvi- ously, this is a significant improvement and was achieved with minimal manual inpuL The time required to make these changes was only of the order of 10 minutes. We repeated this exercise with Topic 54, the initial model-2 tree for which is: Topic[OCRerr]54 <Or> * 0.93 Topic_Path_54_2 <Accrue> ** 0.75 "AGREEMENT" ** 0.75 "ARIANE" ** 0.75 "ARIANESPACE" ** 0.75 "ATLAS" ** 0.75 "COMMERCIAL" ** 0.75 "CONTRACT" ** 0.75 "DELTA" ** 0.75 "DOCUMENT" ** 0.75 "DOUGLAS" ** 0.75 "DYNAMICS" ** 0.75 "INDUSTRY" ** 0.75 "LAUNCH" ** 0.75 "MARTIN" ** 0.75 "MENTION" ** 0.75 "PAYLOAD" ** 0.75 "PRELIMINARY" ** 0.75 "RELEVANT" ** 0.75 "RESERVATION" ** 0.75 "ROCKET" ** 0.75 "SATELLITE" ** 0.75 "SERVICES" ** 0.75 "SPACE" ** 0.75 "TENTATIVE" ** 0.50 "TITAN11 Using the same kinds of procedures (i.e., removing extraneous words and combining words into phrases) we constructed the following modified tree: Topic[OCRerr]54 <Or> * 0.93 Topic_Path_54_2 <Accrue> ** 0.10 `AGREEMENT' ** 0.75 "ARIANE" ** 0.75 "ARIANESPACE" ** 0.75 Atlas_Rocket <Sentence> "ATLAS" "ROCKET" ** 0.75 Commercial_Satellite <Sentence> "COMMERCIAL" "SATELLITE" ** 0.75 "DELTA 1111 261 ** 0.75 "MCDONNELL DOUGLAS" ** 0.75 11GENERAL DYNAMICS" ** 0.10 "LAUNCH" ** 0.75 "MARTIN MARIETTA" ** 0.75 "PAYLOAD" ** 0.75 "ROCKET" ** 0.75 "SATELLITE" ** 0.75 "SPACE" ** 0.50 "TITAN" ** 0.75 1CONTRACT1 ** 0.75 1LAUNCH SERVICE1 Notice that here we actually added words that were part of obvious proper names (i.e., the "GENERAL" of "GENERAL DYNAMICS", the "MARIETTA" of "MARTIN MARl- ETTA", and the "MCDONNELL" of "MCDONNELL DOU- GLAS"), but otherwise nothing was added. We also adjusted the weights on `AGREEMENT' and "LAUNCH" to de- emphasize their importance. The results of running this modified query are: Queryid (Num): 54 Total number of documents over all queries Retrieved: 1000 Relevant: 65 Rel_ret: 65 Interpolated Recall - Precision Averages: at 0.00 0.5800 at 0.10 0.5800 at 0.20 0.5800 at 0.30 0.5800 at 0.40 0.5800 at 0.50 0.4592 at 0.60 0.4592 at 0.70 0.4324 at 0.80 0.3355 at 0.90 0.1474 at 1.00 0.0657 Average precision (non-interpolated) over all rel docs: 0.3889 Precision: At 5 docs: 0.2000 At 10 docs: 0.1000 At 15 docs: 0.3333 At 20 docs: 0.4000 At 30 docs: 0.4667 At 100 docs: 0.4500 At 200 docs: 0.2700 At 500 docs: 0.1260 At 1000 docs: 0.0650 R-Precision (precision after R (= num_rel for a query) docs retrieved) Exact: 0.4615 So that, again for very little manual input' we achieved a significant improvement in precision performance; and this time at no cost to recall.