SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
Step: Input: Process: Output:
4a Parsed-Doc JFeatureScoring[OCRerr] Scored-DocT0[OCRerr]
t
I Weighted-TerrnsT0[OCRerr] I
4b Scored-DocT0[OCRerr] Ranking Top-2000 Scored-Doc(s)T0P
50-Top/2000
4c Scored-Doc(s)T0P
Hand Filter I
= Review Top_I)oc[OCRerr]
5-10 Rel-DOC(s)T0P
"Relevance- Feedback"
Step in Ad-Hoc Cases
Figure 11: Schematic Representation of Processin[OCRerr] When `Relevant' Documents are not C'Tiv[OCRerr]n
find candidate relevant documents. In practice, this required a partitioning of a sample of data
and a review of the returned top-ranked documents. This phase of processing is illustrated in
Figure 11.
As shown in Figure 11, Step 4a, the weighted, relevant terms were taken as a query vector
representing a subset of positive instances of concepts in the equivalence class of the topic. In
the case of ad-hoc querying, the query vector was used to identify a sample of 50 candidate
documents from a subset of the corpus, which were reviewed in rank order by team members
until 5-10 `true' relevant documents were identified (Step 4c). This can be regarded as a
`relevance-feedback' step in the querying process. In the case of routing, the sample of `true'
relevants provided by the TREC organizers was accepted as valid and no review was performed.
4.5 Using Relevant Documents to Create `Partitioning Thesauri'
As indicated in Figure 11 Step 4d, the `authoritative' set of relevant documents was processed
with CLARIT `thesaurus[OCRerr]discovery' modules to produce a set of terms that (arguably) bear
some relation to the topic. We refer to the output of this process as a "pseudo-thesaurus".
The actual routing/partitioning thesaurus was generated by CLARIT by combining the set of
weighted terms for the topic with the pseudo-thesaurus, as shown in Step 5. Note that partial
noun phrases, derived from pseudo-thesaurus entries, and attested in the documents, were also
added to the routing/partitioning thesaurus with a partial score.
As illustrated in Figures 13 (and Figure 14), the partitioning thesaurus itself is a list of
terms, where each term has an associated vector of information specifying its importance in
any number of topics. In the case illustrated for Topic 57, for example, the term "bell system
breakup" has the triple "<057 1 2.0>" associated with it. The "057" indicates that the term is
relevant to Topic 57; the "1" indicates that the term is a full term (not an attested sub-phrase
of a term); and the "2.0" gives the term's relative weight or importance (in this case, reflecting
the score that was assigned by hand).
4.6 `Feature Scoring' to Partition Documents
Figure 14 gives a portion of the composite or `super thesaurus' for all 100 topics. Each
264