SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
thesaurus entries, and attested in the documents, were also added
to the thesaurus with a partial score.
The r()utingipartiti()ning thesaurus was passed over the parsed
representation of 1.2-gigal)yte training set, inducing a ranking of all
5([OCRerr](),([OCRerr]()([OCRerr] docunleilts. The top 2([OCRerr](X) documents were retained for the
next stage.
The next stage of construction of each topic's routing/partitioning
query l)egafl l)y calculating the IDFm the 5 relevant documents that
were ranked highest in the previous stage were added to the
original hand-weighted query terms, forming the final query.
For the second 900-megal)yte data set, the routingipartitioning
thesaurus developed in the first stage of processing ([OCRerr]q descrihed
al)()ve) was used to select the 2000 highest-ranked documents.
The final query pr(KIuced in the second stage (al)ove) was used as
a vector-space query (with partial matching) over the 2000
documents to produce a tinal set of 2(J([OCRerr] ranked documents for each
topic.
III. Searching
A. Tot[OCRerr]il computer tilile to search (cpu seconds)
1. retrieval time (total CPU seconds between when a query enters the system until a list of
document numbers (`trC ()bL[OCRerr]ined)
The final set (jf 2()()(J documents for each topic was collected l)y the use of the
r()utinglpartiti()ning thesaurus (descril)ed al)ove). This process was done
simultaneously for all queries and took al)()ut 6 hours f[OCRerr][OCRerr]r the complete corpus.
2. r£mking time (t()t£'Ll cpu seconds to sort document list)
Once the vector-space matrix for the final set of 2E)([OCRerr]([OCRerr] documents was constructed,
the actual comparison of the query vect([OCRerr]r to all other vectors in the matrix took on
the order of 1()-2(J seconds.
B. Which methods best describe your m£'ichine searching methods?
1. vector space m(xlel
Yes. Using whole and partial matching on IDFITF-weighted terms.
C. What flictors are included in your rai[OCRerr]ing?
1. tenn frequency
2. inverse d([OCRerr]UmCnt frequency
3. other term weights (where do they come from?)
Topic terms were given additional factors of "1", "2", or "3".
7. proxilility of terms
Parts of noun phrases are close. Our partial matching of n[OCRerr][OCRerr]un phrases implicitly
includes proximity.
9. docwnent length
IV. What machine did you conduct the TREC experilnent on'?
How much RAM did it have?
What w£[OCRerr][OCRerr] the clock rate of the CPU'?
Total availalile machines, used variously:
I DECstati()fl 582([OCRerr] (64-Meg RAM)
2 DECstati()Ii 5(X)() (32-Meg RAM)
500