SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman Step: Input: Process: Output: 4a Parsed-Doc JFeatureScoring[OCRerr] Scored-DocT0[OCRerr] t I Weighted-TerrnsT0[OCRerr] I 4b Scored-DocT0[OCRerr] Ranking Top-2000 Scored-Doc(s)T0P 50-Top/2000 4c Scored-Doc(s)T0P Hand Filter I = Review Top_I)oc[OCRerr] 5-10 Rel-DOC(s)T0P "Relevance- Feedback" Step in Ad-Hoc Cases Figure 11: Schematic Representation of Processin[OCRerr] When `Relevant' Documents are not C'Tiv[OCRerr]n find candidate relevant documents. In practice, this required a partitioning of a sample of data and a review of the returned top-ranked documents. This phase of processing is illustrated in Figure 11. As shown in Figure 11, Step 4a, the weighted, relevant terms were taken as a query vector representing a subset of positive instances of concepts in the equivalence class of the topic. In the case of ad-hoc querying, the query vector was used to identify a sample of 50 candidate documents from a subset of the corpus, which were reviewed in rank order by team members until 5-10 `true' relevant documents were identified (Step 4c). This can be regarded as a `relevance-feedback' step in the querying process. In the case of routing, the sample of `true' relevants provided by the TREC organizers was accepted as valid and no review was performed. 4.5 Using Relevant Documents to Create `Partitioning Thesauri' As indicated in Figure 11 Step 4d, the `authoritative' set of relevant documents was processed with CLARIT `thesaurus[OCRerr]discovery' modules to produce a set of terms that (arguably) bear some relation to the topic. We refer to the output of this process as a "pseudo-thesaurus". The actual routing/partitioning thesaurus was generated by CLARIT by combining the set of weighted terms for the topic with the pseudo-thesaurus, as shown in Step 5. Note that partial noun phrases, derived from pseudo-thesaurus entries, and attested in the documents, were also added to the routing/partitioning thesaurus with a partial score. As illustrated in Figures 13 (and Figure 14), the partitioning thesaurus itself is a list of terms, where each term has an associated vector of information specifying its importance in any number of topics. In the case illustrated for Topic 57, for example, the term "bell system breakup" has the triple "<057 1 2.0>" associated with it. The "057" indicates that the term is relevant to Topic 57; the "1" indicates that the term is a full term (not an attested sub-phrase of a term); and the "2.0" gives the term's relative weight or importance (in this case, reflecting the score that was assigned by hand). 4.6 `Feature Scoring' to Partition Documents Figure 14 gives a portion of the composite or `super thesaurus' for all 100 topics. Each 264