NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman 13. aje (he `n[OCRerr]rnu[OCRerr][OCRerr]ly-indexed ienn[OCRerr] u';ed? Yes/No. The manually-indexed terms were treated as additional text and processed (for NPs) along with the other sections ([OCRerr]f the topic statement. They `nay or many not have survived review; they were not given speCial treatment except as potential sources ([OCRerr]f NI's for the topic. 14. o(1'er techniques used to build &L'ita structures (bijef description) The CLARIT system has facilities for the discovery of `first-order' thesauri (= a list of important and characteristic terms) over collections of documents. The techni(Jue re(Iuires that documents in the collection be from the same `domain' or `topic' (broadly conceived) and is relial)le only if the d('cument set is large enough (e.g., minimally 2-3 megabytes). TREC topics--even when supplemented by sets of relevant documents--fall far short of the minimal size re(luired, so general CLARIT thesaurus discovery could not be used in preparing topics ()[OCRerr] to support the indexing of texts. However, ([OCRerr]ne effect of the CLARIT thesaurus-discover procedure is to rank terms in a c([OCRerr]lecti()n based on their fre([OCRerr]uency, distribution, and `rarity' scores. In preparing sets of terms to assist in partitioning the TREC corpus (to identify a subset ([OCRerr]f documents with the best candidates under any topic), we produced pseudo-thesauri for each topic by using CLARIT thesaurus-discovery modules. In particular, the pr([OCRerr]cess produced a list of terms from the available topic-relevant documents (or from a small sample of relevant documents that we may have found) and automatically chose the top (approximately 2()%) ranked terms to supplement the original (luery (as derived from the topic sta(ement) to produce a "r()utinglpartiti()ning thesaurus" t[OCRerr])r the topic. (The use of this resource is described below.) Furthermore, in developuig extended (lueries for [OCRerr][OCRerr]ur final processing step (= a vector-space retrieval), we supplemented the original set of terms for the topic with *all* of the terms from the small set of top-ranking documents (as determined by routing/partitioning score) for each topic. B. Statistics on data structures built from TREC text (please fill out each applicable section) 4. special routing structures (what?) Yes. Each topic text was automatically analyzed by CLARIT to extract NI's. Terms nominated by parsing were reviewed by members ([OCRerr]f the CLARIT team for appropriateness (and retained ()[OCRerr] eliminated) and given a weight of "1", "2", or "3" to (luantify relevance. Available topic-relevant d(wuments were processed for supplemental ternl% (each given a fi'actional weight, e.g., "0.3"). The combined list--terms from the topic text and terms from the topic-relevant documents--formed a "r()utinglpartitioning thesaurus" for the topic. Each TREC document was `scored' against the routing/partitioning thesaurus for each t([OCRerr]pic. In particular, every NI' in each document was matched against the NI's (terms) in the routing thesaurus; partial matches were allowed; a formula yielded a composite score for the document based ()[OCRerr] the number of exact and partial hits as a function ([OCRerr]f document length and term `value'. In the first round (first So topics) of processing, this approach was used to identify the highest-scoring 2()()() documents for each topic. a. total [OCRerr][OCRerr][OCRerr]ount of storage (ine[OCRerr]abytes) ().6 megabytes f[OCRerr])r the merged 50 routing structures, i.e., the 50 "r()utingipartiti()ning thesauri" for the So topics. b. total computer tilne to build (approximate number of hours) S minutes of real time--exclusive of the preparatory time to parse, build a simple index, find SOniC relevant documents, review them, and coml)ine them into an input file. c. is the process completely automatic? The manual review and weighting of terms from the topic statement took 496