SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
At 200 docs: 0.8900
At 500 docs: 0.6220
At 1000 docs: 0.3120
R-Precision (precision after R (= num[OCRerr]re1
for a query) docs retrieved):
Exact: 0.8493
So that although recall decreased slightly (we now retrieve
312 rather than 328 of the 345 relevant documents), the pre-
cision is improved by nearly 50 percentage points. Obvi-
ously, this is a significant improvement and was achieved
with minimal manual inpuL The time required to make
these changes was only of the order of 10 minutes.
We repeated this exercise with Topic 54, the initial
model-2 tree for which is:
Topic[OCRerr]54 <Or>
* 0.93 Topic_Path_54_2 <Accrue>
** 0.75 "AGREEMENT"
** 0.75 "ARIANE"
** 0.75 "ARIANESPACE"
** 0.75 "ATLAS"
** 0.75 "COMMERCIAL"
** 0.75 "CONTRACT"
** 0.75 "DELTA"
** 0.75 "DOCUMENT"
** 0.75 "DOUGLAS"
** 0.75 "DYNAMICS"
** 0.75 "INDUSTRY"
** 0.75 "LAUNCH"
** 0.75 "MARTIN"
** 0.75 "MENTION"
** 0.75 "PAYLOAD"
** 0.75 "PRELIMINARY"
** 0.75 "RELEVANT"
** 0.75 "RESERVATION"
** 0.75 "ROCKET"
** 0.75 "SATELLITE"
** 0.75 "SERVICES"
** 0.75 "SPACE"
** 0.75 "TENTATIVE"
** 0.50 "TITAN11
Using the same kinds of procedures (i.e., removing
extraneous words and combining words into phrases) we
constructed the following modified tree:
Topic[OCRerr]54 <Or>
* 0.93 Topic_Path_54_2 <Accrue>
** 0.10 `AGREEMENT'
** 0.75 "ARIANE"
** 0.75 "ARIANESPACE"
** 0.75 Atlas_Rocket <Sentence>
"ATLAS"
"ROCKET"
** 0.75 Commercial_Satellite <Sentence>
"COMMERCIAL"
"SATELLITE"
** 0.75 "DELTA 1111
261
** 0.75 "MCDONNELL DOUGLAS"
** 0.75 11GENERAL DYNAMICS"
** 0.10 "LAUNCH"
** 0.75 "MARTIN MARIETTA"
** 0.75 "PAYLOAD"
** 0.75 "ROCKET"
** 0.75 "SATELLITE"
** 0.75 "SPACE"
** 0.50 "TITAN"
** 0.75 1CONTRACT1
** 0.75 1LAUNCH SERVICE1
Notice that here we actually added words that were part of
obvious proper names (i.e., the "GENERAL" of "GENERAL
DYNAMICS", the "MARIETTA" of "MARTIN MARl-
ETTA", and the "MCDONNELL" of "MCDONNELL DOU-
GLAS"), but otherwise nothing was added. We also adjusted
the weights on `AGREEMENT' and "LAUNCH" to de-
emphasize their importance.
The results of running this modified query are:
Queryid (Num): 54
Total number of documents over all queries
Retrieved: 1000
Relevant: 65
Rel_ret: 65
Interpolated Recall - Precision Averages:
at 0.00 0.5800
at 0.10 0.5800
at 0.20 0.5800
at 0.30 0.5800
at 0.40 0.5800
at 0.50 0.4592
at 0.60 0.4592
at 0.70 0.4324
at 0.80 0.3355
at 0.90 0.1474
at 1.00 0.0657
Average precision (non-interpolated) over
all rel docs:
0.3889
Precision:
At 5 docs: 0.2000
At 10 docs: 0.1000
At 15 docs: 0.3333
At 20 docs: 0.4000
At 30 docs: 0.4667
At 100 docs: 0.4500
At 200 docs: 0.2700
At 500 docs: 0.1260
At 1000 docs: 0.0650
R-Precision (precision after R (= num_rel
for a query) docs retrieved)
Exact: 0.4615
So that, again for very little manual input' we achieved a
significant improvement in precision performance; and this
time at no cost to recall.