NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman extracted from a partition of the database. For the ad-hoc queries, the partition used for the statistics was the same as the partition actually being queried. For the routing queries, however, the final query vector was fixed before processing the new text (i.e., the second set of TREC documents). In particular, in this case, the partition used to weight the routing-query vector was extracted from the training corpus (the first set of TREC documents); this vector was then queried against a partition extracted from the new, test corpus. The NPs and their contained words among the documents in each partition were scored for distribution and frequency; each NP/term- and word-type was given an IDF-TF score. As noted above, for routing queries, the IDF-TF score was based on statistics from the original partition of 2000 documents from the training corpus; it was a static query vector. For the ad-hoc queries, on the other hand, the final partition of 2000 documents was used as the source of statistics for the IDF-TF scoring. Therefore, the scores for terms in the query vector for the ad-hoc queries could vary depending on the set of documents selected in the partitioning process. Figure 21 gives a sample of a final query. The terms in each topic's routing/partitioning thesaurus were given IDF-TF scores based on the sample; original-query terms were added and the factors of those terms ("1", "2", or "3") were used to multiply their IDF-TF-based scores; the combined terms and their contained words thus formed an extended-query vector (the final query vector). The 2000 documents for each topic were modeled in vector space (in which all terms and their contained words formed the dimensions) and the final query vector was used to identify and rank the 200 `best' documents, which constituted our results. 4.8 Summary of the Process Figures 22 and 23 summarize the CLARIT-TREC processes described in detail in the preceed- mg sections. As noted previously, there were only two steps in the CLARIT-TREC process that required non-automatic processing: (1) initial review and weighting of the index terms automatically-nominated and derived for the topic and (2) in the case of ad-hoc queries, review of first-pass retrieved documents to identify 5-10 relevant ones for use in creating a pseud[OCRerr] thesaurus for further processing. 5 Results and Evaluation This section presents the CLARIT-TREC results in several forms, including broad overviews of the performance, the "official" results tables, and tables of data that focus on statistics that are especially relevant to the CLARIT-TREC approach. Results are presented with only abbreviated explanations.16 As noted previously, the CLARIT team submitted both intermediate results ("A") and final results ("B"). The intermediate results were generated by taking the highest-scoring 200 (out of 2000) documents as determined by the routing/partitioning process. Since the strategy of rout- ing/partitioning was to nominate a moderately large candidate subset of documents in which all the true relevants would be found and since the procedure and scoring were designed to over- generate candidates, we expected to have many `false positives' in each set of 2000. We had no reason to expect the relative ranking of these documents by their evoking routing/partitioning scores would be a good measure of fit to the source topic. By contrast, we expected the final steps (which utilize subset-specific term scoring and vector-space similarity measures) to induce a relative ranking of documents that would represent a good fit to the source topic. 16More detailed analysis of the results is given in [Evans et al. in preparation]. 271