SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
completion
(4) ()[OCRerr]C[OCRerr] (describe)
Initially denved from on-line sources but substantially modified and
maintained manually
2. extenially-built auxili[OCRerr][OCRerr]y file
a. type of file (Treebank, WordNet, etc.) None
b. toL[OCRerr]l aifloUnt of storage (megabytes)
C. to[OCRerr][OCRerr] number of concepts represented
d. type of representation (fr[OCRerr]unes, selnailtic nets, rules, etc.)
II. Query Construction
(please fill oUt a section for each query construction method used)
B. Manually constructed queries (ad h()(:)
N([OCRerr]te, as descnbed below, there were only two steps in the CLARIT process that re(luired
non-automatic pr(wes.%ing: (I) initial review and weighting of the index terms
aut()matically-n(Jminated and derived f[OCRerr][OCRerr]r the topic and (2) review of 1st-pass retrieved
documents to identify 5-I([OCRerr] relevant OneS for "feedback".
1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <del>
2. average tilne to build query (minutes)
5 minutes--average time to review & weight automatically-nominated terms
3. type of query builder Graduate students
4. tools used to build query
c. other lexical tools (identify)
CLARIT noun-phrase parsing (extraction) nominated query terms from the
textual descriptions of topk's.
5. which of the following were used?
a. terin weighting
Yes. Graduate students weighted terms with weights of "3", "2", or "1",
according to whether the extracted terin was central or peripheral to the
topic. (Sonic extracted noun phrases were discarded as irrelevant or
ill-formed; the vast majority were retained.)
C. proxitnity operators
No. Though proximity plays an implicit role when noun phrases are used as
terms.
d. addition of terins not jucluded ill topic
(1) source of terms Not in the first round of routing
C. other (describe)
The ad hoc queries for the second fifty topi[OCRerr]s were formed in three stages.
The first stage was the construction of a topic-derived routingipartitioning
thesaurus.
The routingipartitioning thesaurus was generated by CLARIT from the
method described al)()ve, using only text fields of the topics. The
automatically derived noun phrases were hand-weighted by graduate
students with weights ([OCRerr]f "3", "2", or "1", according to whether the
extracted term was central or peripheral to the topic. Some extraneous
terms were deleted.
The routingipartitioning thesaurus was passed over the parsed
representation of original 1.2 gigabyte training set, inducing a ranking of all
[OCRerr] documents using a scoring method taking account of exact and
partial matches and document length. The top 5([OCRerr] documents were retained,
for the next stage. These documents were manually judged by graduate
498