SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman Run [OCRerr] base nyniri nyufr2 nyuir[OCRerr] Name {_routing routing routing routing Queries 50 50 50 50 Tot number of does over all queries T Ret 50000 J 50000 50000 50000 Rel 2064 2064 2064 2064 ReIRet 1349 1390 1610 1623 Recall (interp) Precision Averages 0.00 0.5276 0.5400 0.6435 0.6458 0.10 0.3685 0.3937 0.4610 0.5021 0.20 0.3054 0.3423 0.3705 0.4151 0.30 0.2373 0.2572 0.3031 0.3185 0.40 0.2039 0.2263 0.2637 0.2720 0.50 0.1824 0.2032 0.2282 0.2379 0.60 0.1596 0.1674 0.1934 0.1899 0.70 0.1167 0.1295 0.1542 0.1571 0.80 0.0854 0.0905 0.1002 0.1163 0.90 0.0368 0.0442 0.0456 0.0434 1.00 0.0228 0.0284 0.0186 0.0158 Average precision over all rel does Avg [OCRerr] 0.1884 T 0.2038_T[OCRerr]_0.2337 [OCRerr] 0.2466 Precision at 5 docs 0.3160 0.3360 0A280 0.4440 10 does 0.3100 0.3240 OAOOO 0A180 15 does 0.2813 0.2933 0.3613 0.3800 20 does 0.2670 0.2790 0.3260 0.3530 30 does 0.2240 0.2404 0.2760 0.2993 l00docs 0.1306 0.1412 0.1708 0.1698 200 does 0.0865 0.0939 0.1078 0.1107 500 docs 0.0464 0.0489 0.0575 0.0570 1000 does 0.0270 0.0278 0.0322 0.0325 R-Precision (alter Rel) Exact 0.21% 0.2267 0.2513 0.2820 Table 3. Automatic routing run statistics for queries 51-100 against SJMN database: (1) base - statistical terms only with <desc> and <!1arr> fields; (2) ayuirl - using syntactic phrases and similarities with <deac> and [OCRerr]arr> fields only; (3) nyuir2 - same as 2 but with <deac>, <con>, and <fac> fields only; and (4) nyuir2a - run nyuir2 repeated with new weighting for phrees. TERM WEIGHTING ISSUES Finding a proper term weighting scheme is critical in te[ln-based retrieval since the rank of a document is 133 determined by the weights of the terms it shares with the query. One popular term weighting scheme, known as tf.idf, weights terms proportionately to their inverted document frequency scores and to their in-document fre- quencies (tf). The in-document frequency factor is usu- ally normalized by the document length, that is, it is more significant for a term to occcr 5 times in a short 20-word document, than to occur 10 times in a 1000- word article. 16 In our official ThEC runs we used the normalized tf.idf weights for all terms alike: single `ordinary-word' terms, proper names, as well as phrasal terms consisting of 2 or more words. Whenever phrases were included in the term set of a document, the length of this document was increased accordingly. This had the effect of decreasing tf factors for `regular' single word terms. A standard tf.idf weighting scheme (and we suspect any other uniform scheme based on frequencies) is inappropriate for mixed term sets (ordinary concepts, proper names, phrases) because: (1) It favors terms that occur fairly frequendy in a document, which supports only general-type queries (e.g., "all you know about `star wars"'). Such queries are not typical in ThEC. (2) It attaches low weights to infrequent, highly specific terms, such as names and phrases, whose only occllfrences in a document often decide of relevance. Note that such terms cannot be reli- ably distinguished using their distribution in the database as the sole factor, and therefore syntac- tic and lexical information is required. (3) It does not address the problem of inter-term dependencies arising when phrasal terms and their component single-word terms are all included in a document representation, i.e., launch+satellite and satellite are not indepen- dent, and it is unclear whether they should be counted as two terms. In our post-ThEC-2 experiments we considered (1) and (2) only. We changed the weighting scheme so that the phrases (1)ut not the names which we did not dis- tinguish in ThEC-2) were more heavily weighted by their idf scores while the in-document frequency scores were replaced by logarithms multiplied by sufficiendy large constants. In addition, the top N highest-idf match- ing terms (simple or compound) were counted more toward the document score than the remaining terms. This `hot-spot' retrieval option is discussed in the next section. 1' mj[OCRerr] is not always true, for example when all occurences of a term are concentrated in a single section or a paragraph rather than spread around the article. See the following section for more discussion.