SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman (3) nyuir3: A run Of manually pruned topics 101-150 against the WSJ database with the following fields used: <tide>, <dese>, <con> and [OCRerr]ac> only. Both syntactic phrases and term simliarities were included. Manual intervention involved removing some terms from queries before data- base search. Summary statistics for these runs are shown in Table 2. In addition, the `base' column reports the system's performance on text fields with no language preprocessing, and no phrase terms or similarities useCL We must note, however, that in all cases the topics have been processed with our suffix-trirniner, which means some NLP has been done already (tagging + lexicoii), and therefore what we do not see here a performance of `pure' statistical system. In the routing category only automatic runs were done (again, these are the official ThEC-2 results): (1) nyuirl: An automatic run of topics 51-100 against the SJMN database with the following fields used: <tide>, <desc>, and <n&r> only. Both syntactic phrases and term similarities were included. (2) nyuir2: An automatic run of topics 51-100 against the SJMN database with the following fields used: <tide>, <desc>, <con> and 4ac> only. Both syntactic phrases and term similarities were included. A (simulated) routing mode run means that queries (i.e., terms and their weights) were derived with respect to a (different) training database (here WSJ), and were subsequendy run against the new database (here SJMN). In particular, this means that the terms and their relative importailce (reflected primarily through idf weights) were those Of WSJ database rather than S[OCRerr]N database. Routing runs are summarized in Table 3. Again a column `base' is added to show the system's perfor- mance without [OCRerr]IP module. We may note that the rout- ing results are generally well below the ad-hoc results, both because the base system performance is inferior and because query processing has a different effect on the final statistics. The last column is a post-[OCRerr]fl[OCRerr]C run.15 `51t should he noted that in category B runs, three tooics (63,65, and 88) had no relevant documents in SJMN database. Unfortunately, the evaluation program counts those as if there were relevant documents but none had been found, thus underestimating the system's perfor- mance by 5 to 8%. Excluding these three topics from consideration we obtain, in the last column, the average precision of 0.2624 and the R- precision of 0.3000. 132 Run base nyuirl nyuir2 nyuir3 Name ad-hoc ad-hoc ad-hoc ad-hoc Queries 50 50 50 50 Tot nimiher of docs over all queries Ret 49387 49834 49876 49877 Rd 3929 3929 3929 3929 ReIRet 2740 2983 3274 3281 Recall (interp) Precision Averages 0.00 0.7038 0.7013 0.7528 0.7528 0.10 0A531 0.4874 0.5567 0.5574 0.20 0.3708 0.4326 0A721 0A724 0.30 0.3028 0.3531 0A060 0A076 0.40 0.2550 0.3076 0.3617 0.3621 0.50 0.2059 0.2637 0.3135 0.3142 0.00 0.1641 0.2175 0.2703 0.2711 0.70 0.1180 0.1617 0.2231 0.2237 0.80 0.0766 0.1176 0.1667 0.1697 0.90 0.0417 0.0684 0.0915 0.0916 1.00 0.0085 0.0102 0.0154 0.0160 Average precision over all rel docs Avg 0.2224 0.2649 0.3111 0.3118 Precision at S docs 0.4640 0.4920 0.5360 0.5360 10 docs 0A140 0.4420 0A880 0A880 15 docs 0.3867 0.4240 0A693 0.4707 20 docs 0.3670 0.4050 0A390 0.4410 30 docs 0.3253 0.3640 0A067 0.4080 100 does 0.2304 0.2720 0.3094 0.3094 200does 0.1626 0.1886 0.2139 0.2140 S00docs 0.0911 0.1026 0.1137 0.1140 1000 docs 0.0548 0.0597 0.0655 0.0656 R-Precision (after ReIRet) [OCRerr]0.26050.30030.33200.332l Table [OCRerr] Automatic ad-hoc run statistics for queries 101-150 against WSJ database: (1) base- statistical terms only with <desc> and <narr> fields; (2) nyuirl - using syntactic phrases and similarities with <desc> and <narr> fields only; (3) nyuir2 - same as 2 but with <desc>, <con>, and <fac> fields only; and (4) [OCRerr]uir3 - same as 3 but queries manually pruned before search.