SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
students, st[OCRerr]rting from the highest scored downward until 5-lo relevant
documents were found. In etfect, this represented a "relevance-feedl)ack"
step in the retrieval pr[OCRerr][OCRerr]ess.
In the next stage, the 5-I([OCRerr] "relevant" d(K:uments were used to produce a
CLARIT-derived pseudo-thesaurus f[OCRerr])r the topic. (As descril)ed ahove, this
Consists of a list of prominent terms in the collection of documents, h[OCRerr]sed
on frequency, distril)uti()n, and "rarity" scores.) To this thesaurus were
added the ternis retained from the hand-weighting of the original topics.
This thesaurus fi)rmed the second routing/partitioning thesaurus. The entire
2-gigahyte TREC collection was rescored against this second
routingipartitioni ng thesaurus and the highest ranking 2([OCRerr]([OCRerr]([OCRerr] documents were
selected fi)r the final-query stage.
The third, ()[OCRerr] final-query, stage involved, first, calculating an IDF/TF score
fi)r each term and all term-contained words in the 2(X[OCRerr])-document set for
the topic. The query for that topic was created l)y taking the IDF/TF
weightings [OCRerr] the ternis from the originally chosen 5-1([OCRerr] relevant documents
and automatically forming a query l)y coml)ining all these terms along with
the topic-derived terms into a long query vector. A vector-space
representation ([OCRerr]f the 2EX)([OCRerr] documents was generated; the query vector was
used to identify the final set of 2()() ranked documents for each topic l)ased
oil cosine similarity measures.
D. Automatic[OCRerr]'ylly built queries (routing)
1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <det[OCRerr]
2. total coinpuler tilne to build query (cpu secoilds) ([OCRerr].()3 cpu seconds
3. which of the fi)llowing were used iii building [OCRerr]e query'?
a. terms selected from
(1) topic
(3) only documents with relevuice j udgineilts
b. tenn weighting
(1) with weights based oil terms in topics
Yes. Topic terms were initially hand weighted.
c. phr[OCRerr]'ise extraction
(1) from topics
(3) from d(icuinents with relev[OCRerr]'uice judgments
d. syntactic p£irsing
(1) of topics
(2) of [OCRerr]`ill irLining documents
(3) of documents wi[OCRerr] relevance judgments
g. tokenizer (rec()L[OCRerr]nizes d[OCRerr][OCRerr]tes, phone numbers, CoilliflOil pattenis)
(1) which patterns [OCRerr]`ut tokenized'?
Only simple acronyms such as "I.B.M."
recognized as a unit.
description)
The routing queries were fi)rmed in two stages.
The first stage was the construction of a routingipartitioning
were automatically
k. other (brief
thesaurus.
The routing/partitioning thesaurus was generated l)y CLARIT from
the supplied list of relevant documents per topic. The text of the
topic fields was parsed and added to the pseudo[OCRerr]thesaurus derived
from the relevant d(wuments. (Each pseudo[OCRerr]thesaurus consists of
automatically chosen noun phrases scoring ahove a certain
threshold, when scored fi[OCRerr]r rarity, distrihution, and frequency in the
relevant document set.) Partial noun phrases, derived from
499