NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman completion (4) ()[OCRerr]C[OCRerr] (describe) Initially denved from on-line sources but substantially modified and maintained manually 2. extenially-built auxili[OCRerr][OCRerr]y file a. type of file (Treebank, WordNet, etc.) None b. toL[OCRerr]l aifloUnt of storage (megabytes) C. to[OCRerr][OCRerr] number of concepts represented d. type of representation (fr[OCRerr]unes, selnailtic nets, rules, etc.) II. Query Construction (please fill oUt a section for each query construction method used) B. Manually constructed queries (ad h()(:) N([OCRerr]te, as descnbed below, there were only two steps in the CLARIT process that re(luired non-automatic pr(wes.%ing: (I) initial review and weighting of the index terms aut()matically-n(Jminated and derived f[OCRerr][OCRerr]r the topic and (2) review of 1st-pass retrieved documents to identify 5-I([OCRerr] relevant OneS for "feedback". 1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <del> 2. average tilne to build query (minutes) 5 minutes--average time to review & weight automatically-nominated terms 3. type of query builder Graduate students 4. tools used to build query c. other lexical tools (identify) CLARIT noun-phrase parsing (extraction) nominated query terms from the textual descriptions of topk's. 5. which of the following were used? a. terin weighting Yes. Graduate students weighted terms with weights of "3", "2", or "1", according to whether the extracted terin was central or peripheral to the topic. (Sonic extracted noun phrases were discarded as irrelevant or ill-formed; the vast majority were retained.) C. proxitnity operators No. Though proximity plays an implicit role when noun phrases are used as terms. d. addition of terins not jucluded ill topic (1) source of terms Not in the first round of routing C. other (describe) The ad hoc queries for the second fifty topi[OCRerr]s were formed in three stages. The first stage was the construction of a topic-derived routingipartitioning thesaurus. The routingipartitioning thesaurus was generated by CLARIT from the method described al)()ve, using only text fields of the topics. The automatically derived noun phrases were hand-weighted by graduate students with weights ([OCRerr]f "3", "2", or "1", according to whether the extracted term was central or peripheral to the topic. Some extraneous terms were deleted. The routingipartitioning thesaurus was passed over the parsed representation of original 1.2 gigabyte training set, inducing a ranking of all [OCRerr] documents using a scoring method taking account of exact and partial matches and document length. The top 5([OCRerr] documents were retained, for the next stage. These documents were manually judged by graduate 498