SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
Precision:
At 5 docs: 0.0000
At 10 docs: 0.0000
At 15 docs: 0.0000
At 20 docs: 0.0000
At 30 docs: 0.0000
At 100 docs: 0.0500
At 200 docs: 0.0600
At 500 docs: 0.1080
At 1000 docs: 0.0640
R-Precision (precision after R (= num[OCRerr]rel
for a query) docs retrieved):
Exact: 0.0308
Thus although recall is very good, precision is completely
unsatisfactory.
Our conjecture is that the automatically con-
structed model-2 trees while generally giving good recall
give poor precision because they contain many extraneous
features, or features that should he combined. To illustrate
this, we considered the model-2 tree for Topic 52, as a start-
ing point for a manually constructed tree. The initial model-
2 tree is:
Topic[OCRerr]52 <Or>
* 0.86 Topic[OCRerr]Style[OCRerr]52[OCRerr]2 <Accrue>
** 0.75 "AFRICA"
** 0.25 "AFRICAN"
** 0.75 "APARTHEID"
** 0.25 `ARMS"
** 0.25 "BAN"
** 0.25 "BLACK"
** 0.25 "COMPANY"
** 0.25 "COMPLIANCE"
** 0.25 "CONTR[OCRerr]C'TS"
** 0.25 "CORPORATE"
** 0.25 "DISCUSS"
** 0.25 "DOCUMENT"
** 0.50 "DOMINATION"
** 0.25 "GOVERNMENT"
** 0.25 "INTERNATIONAL"
** 0.25 "INVESTMENT"
** 0.25 "ORGANIZATION"
** 0.25 "PRESSURE"
** 0.75 "PRETORIA"
** 0.25 "REDUCTION"
** 0.25 "RESPONSE"
** 0.75 "SANCTIONS"
** 0.75 "SOUTH"
** 0.25 "TIES"
** 0.25 "TRADE"
** 0.25 "UNITED"
Obviously there are a number of features here that are basi-
cally "noise" - for example the words `COMPANY" and
`RESPONSE'; and other words are clearly elements of a
larger phrase - for example the words "SOUTH" and
260
AFRICA ". Notice that, in general, words with lower
scores are always candidates for elimination.
The result of this pruning exercise was the follow-
ing revised definition for Topic 52:
Topic[OCRerr]52 <Or>
* 0.86 TopicStyle[OCRerr]52-2 <Accrue>
** 0.50 5_Africa <Accrue>
0.50 `SOUTH AFRICA'
0.50 "PRETORIA"
** 0.50 `SANCTIONS'
** 0.20 Topic[OCRerr]52[OCRerr]Support <Accrue>
0.50 "APARTHEID"
[OCRerr] 0.50 <Near>
`BAN'
`TRADE'
[OCRerr]** 0.50 <Near>
`BAN'
`INVESTMENT'
So although we have added no new features, we have com-
bined "SOUTH" and "AFRICA" and used this together
with "PRETORIA" to define a concept called S_Africa.
We have also used "APARTHEID"; and "BAN" with
N TRADE' and `INVESTMENT" to define another concept
called Topic[OCRerr]52[OCRerr]Support. Finally we adjusted the
weights to give more prominence to S_Africa than Top-
ic_52_Support.
The results for this modified topic description are:
Queryid (Num): 52
Total number of documents over all queries
Retrieved: 1000
Relevant: 345
Rel_ret:
Interpolated
at 0.00
at 0.10
at 0.20
at 0.30
at 0.40
at 0.50
at 0.60
at 0.70
at 0.80
at 0.90
at 1.00
Average precision
all rel docs:
312
Recall - Precision Averages:
1.0000
1.0000
0.9780
0.9766
0.9603
0.9067
0.8620
0.8620
0.8620
0.7422
0.0000
(non-interpolated) over
0.8305
Precision:
At 5 docs: 1.0000
At 10 docs: 1.0000
At 15 docs: 1.0000
At 20 docs: 1.0000
At 30 docs: 1.0000
At 100 docs: 0.9700