NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman thesaurus entries, and attested in the documents, were also added to the thesaurus with a partial score. The r()utingipartiti()ning thesaurus was passed over the parsed representation of 1.2-gigal)yte training set, inducing a ranking of all 5([OCRerr](),([OCRerr]()([OCRerr] docunleilts. The top 2([OCRerr](X) documents were retained for the next stage. The next stage of construction of each topic's routing/partitioning query l)egafl l)y calculating the IDF![F score of all the terms and their contained words in the 2(K[OCRerr]([OCRerr] retained documents for that topic. The IDF[1[OCRerr]F-weighted terms fi[OCRerr]()m the 5 relevant documents that were ranked highest in the previous stage were added to the original hand-weighted query terms, forming the final query. For the second 900-megal)yte data set, the routingipartitioning thesaurus developed in the first stage of processing ([OCRerr]q descrihed al)()ve) was used to select the 2000 highest-ranked documents. The final query pr(KIuced in the second stage (al)ove) was used as a vector-space query (with partial matching) over the 2000 documents to produce a tinal set of 2(J([OCRerr] ranked documents for each topic. III. Searching A. Tot[OCRerr]il computer tilile to search (cpu seconds) 1. retrieval time (total CPU seconds between when a query enters the system until a list of document numbers (`trC ()bL[OCRerr]ined) The final set (jf 2()()(J documents for each topic was collected l)y the use of the r()utinglpartiti()ning thesaurus (descril)ed al)ove). This process was done simultaneously for all queries and took al)()ut 6 hours f[OCRerr][OCRerr]r the complete corpus. 2. r£mking time (t()t£'Ll cpu seconds to sort document list) Once the vector-space matrix for the final set of 2E)([OCRerr]([OCRerr] documents was constructed, the actual comparison of the query vect([OCRerr]r to all other vectors in the matrix took on the order of 1()-2(J seconds. B. Which methods best describe your m£'ichine searching methods? 1. vector space m(xlel Yes. Using whole and partial matching on IDFITF-weighted terms. C. What flictors are included in your rai[OCRerr]ing? 1. tenn frequency 2. inverse d([OCRerr]UmCnt frequency 3. other term weights (where do they come from?) Topic terms were given additional factors of "1", "2", or "3". 7. proxilility of terms Parts of noun phrases are close. Our partial matching of n[OCRerr][OCRerr]un phrases implicitly includes proximity. 9. docwnent length IV. What machine did you conduct the TREC experilnent on'? How much RAM did it have? What w£[OCRerr][OCRerr] the clock rate of the CPU'? Total availalile machines, used variously: I DECstati()fl 582([OCRerr] (64-Meg RAM) 2 DECstati()Ii 5(X)() (32-Meg RAM) 500