SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
Run [OCRerr] base nyniri nyufr2 nyuir[OCRerr]
Name {_routing routing routing routing
Queries 50 50 50 50
Tot number of does over all queries
T
Ret 50000 J 50000 50000 50000
Rel 2064 2064 2064 2064
ReIRet 1349 1390 1610 1623
Recall (interp) Precision Averages
0.00 0.5276 0.5400 0.6435 0.6458
0.10 0.3685 0.3937 0.4610 0.5021
0.20 0.3054 0.3423 0.3705 0.4151
0.30 0.2373 0.2572 0.3031 0.3185
0.40 0.2039 0.2263 0.2637 0.2720
0.50 0.1824 0.2032 0.2282 0.2379
0.60 0.1596 0.1674 0.1934 0.1899
0.70 0.1167 0.1295 0.1542 0.1571
0.80 0.0854 0.0905 0.1002 0.1163
0.90 0.0368 0.0442 0.0456 0.0434
1.00 0.0228 0.0284 0.0186 0.0158
Average precision over all rel does
Avg [OCRerr] 0.1884 T 0.2038_T[OCRerr]_0.2337 [OCRerr] 0.2466
Precision at
5 docs 0.3160 0.3360 0A280 0.4440
10 does 0.3100 0.3240 OAOOO 0A180
15 does 0.2813 0.2933 0.3613 0.3800
20 does 0.2670 0.2790 0.3260 0.3530
30 does 0.2240 0.2404 0.2760 0.2993
l00docs 0.1306 0.1412 0.1708 0.1698
200 does 0.0865 0.0939 0.1078 0.1107
500 docs 0.0464 0.0489 0.0575 0.0570
1000 does 0.0270 0.0278 0.0322 0.0325
R-Precision (alter Rel)
Exact 0.21% 0.2267 0.2513 0.2820
Table 3. Automatic routing run statistics for queries 51-100 against
SJMN database: (1) base - statistical terms only with <desc> and
<!1arr> fields; (2) ayuirl - using syntactic phrases and similarities with
<deac> and [OCRerr]arr> fields only; (3) nyuir2 - same as 2 but with <deac>,
<con>, and <fac> fields only; and (4) nyuir2a - run nyuir2 repeated
with new weighting for phrees.
TERM WEIGHTING ISSUES
Finding a proper term weighting scheme is critical
in te[ln-based retrieval since the rank of a document is
133
determined by the weights of the terms it shares with the
query. One popular term weighting scheme, known as
tf.idf, weights terms proportionately to their inverted
document frequency scores and to their in-document fre-
quencies (tf). The in-document frequency factor is usu-
ally normalized by the document length, that is, it is
more significant for a term to occcr 5 times in a short
20-word document, than to occur 10 times in a 1000-
word article. 16
In our official ThEC runs we used the normalized
tf.idf weights for all terms alike: single `ordinary-word'
terms, proper names, as well as phrasal terms consisting
of 2 or more words. Whenever phrases were included in
the term set of a document, the length of this document
was increased accordingly. This had the effect of
decreasing tf factors for `regular' single word terms.
A standard tf.idf weighting scheme (and we
suspect any other uniform scheme based on frequencies)
is inappropriate for mixed term sets (ordinary concepts,
proper names, phrases) because:
(1)
It favors terms that occur fairly frequendy in a
document, which supports only general-type
queries (e.g., "all you know about `star wars"').
Such queries are not typical in ThEC.
(2) It attaches low weights to infrequent, highly
specific terms, such as names and phrases, whose
only occllfrences in a document often decide of
relevance. Note that such terms cannot be reli-
ably distinguished using their distribution in the
database as the sole factor, and therefore syntac-
tic and lexical information is required.
(3) It does not address the problem of inter-term
dependencies arising when phrasal terms and
their component single-word terms are all
included in a document representation, i.e.,
launch+satellite and satellite are not indepen-
dent, and it is unclear whether they should be
counted as two terms.
In our post-ThEC-2 experiments we considered
(1) and (2) only. We changed the weighting scheme so
that the phrases (1)ut not the names which we did not dis-
tinguish in ThEC-2) were more heavily weighted by
their idf scores while the in-document frequency scores
were replaced by logarithms multiplied by sufficiendy
large constants. In addition, the top N highest-idf match-
ing terms (simple or compound) were counted more
toward the document score than the remaining terms.
This `hot-spot' retrieval option is discussed in the next
section.
1' mj[OCRerr] is not always true, for example when all occurences of a
term are concentrated in a single section or a paragraph rather than
spread around the article. See the following section for more discussion.