SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
13. aje (he `n[OCRerr]rnu[OCRerr][OCRerr]ly-indexed ienn[OCRerr] u';ed?
Yes/No. The manually-indexed terms were treated as additional text and processed
(for NPs) along with the other sections ([OCRerr]f the topic statement. They `nay or many
not have survived review; they were not given speCial treatment except as potential
sources ([OCRerr]f NI's for the topic.
14. o(1'er techniques used to build &L'ita structures (bijef description)
The CLARIT system has facilities for the discovery of `first-order' thesauri (= a list
of important and characteristic terms) over collections of documents. The techni(Jue
re(Iuires that documents in the collection be from the same `domain' or `topic'
(broadly conceived) and is relial)le only if the d('cument set is large enough (e.g.,
minimally 2-3 megabytes). TREC topics--even when supplemented by sets of
relevant documents--fall far short of the minimal size re(luired, so general CLARIT
thesaurus discovery could not be used in preparing topics ()[OCRerr] to support the indexing
of texts. However, ([OCRerr]ne effect of the CLARIT thesaurus-discover procedure is to
rank terms in a c([OCRerr]lecti()n based on their fre([OCRerr]uency, distribution, and `rarity' scores.
In preparing sets of terms to assist in partitioning the TREC corpus (to identify a
subset ([OCRerr]f documents with the best candidates under any topic), we produced
pseudo-thesauri for each topic by using CLARIT thesaurus-discovery modules. In
particular, the pr([OCRerr]cess produced a list of terms from the available topic-relevant
documents (or from a small sample of relevant documents that we may have found)
and automatically chose the top (approximately 2()%) ranked terms to supplement
the original (luery (as derived from the topic sta(ement) to produce a
"r()utinglpartiti()ning thesaurus" t[OCRerr])r the topic. (The use of this resource is described
below.)
Furthermore, in developuig extended (lueries for [OCRerr][OCRerr]ur final processing step (= a
vector-space retrieval), we supplemented the original set of terms for the topic with
*all* of the terms from the small set of top-ranking documents (as determined by
routing/partitioning score) for each topic.
B. Statistics on data structures built from TREC text (please fill out each applicable section)
4. special routing structures (what?)
Yes. Each topic text was automatically analyzed by CLARIT to extract NI's. Terms
nominated by parsing were reviewed by members ([OCRerr]f the CLARIT team for
appropriateness (and retained ()[OCRerr] eliminated) and given a weight of "1", "2", or "3"
to (luantify relevance. Available topic-relevant d(wuments were processed for
supplemental ternl% (each given a fi'actional weight, e.g., "0.3"). The combined
list--terms from the topic text and terms from the topic-relevant documents--formed
a "r()utinglpartitioning thesaurus" for the topic.
Each TREC document was `scored' against the routing/partitioning thesaurus for
each t([OCRerr]pic. In particular, every NI' in each document was matched against the NI's
(terms) in the routing thesaurus; partial matches were allowed; a formula yielded
a composite score for the document based ()[OCRerr] the number of exact and partial hits
as a function ([OCRerr]f document length and term `value'.
In the first round (first So topics) of processing, this approach was used to identify
the highest-scoring 2()()() documents for each topic.
a. total [OCRerr][OCRerr][OCRerr]ount of storage (ine[OCRerr]abytes)
().6 megabytes f[OCRerr])r the merged 50 routing structures, i.e., the 50
"r()utingipartiti()ning thesauri" for the So topics.
b. total computer tilne to build (approximate number of hours)
S minutes of real time--exclusive of the preparatory time to parse, build a
simple index, find SOniC relevant documents, review them, and coml)ine them
into an input file.
c. is the process completely automatic?
The manual review and weighting of terms from the topic statement took
496