NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman students, st[OCRerr]rting from the highest scored downward until 5-lo relevant documents were found. In etfect, this represented a "relevance-feedl)ack" step in the retrieval pr[OCRerr][OCRerr]ess. In the next stage, the 5-I([OCRerr] "relevant" d(K:uments were used to produce a CLARIT-derived pseudo-thesaurus f[OCRerr])r the topic. (As descril)ed ahove, this Consists of a list of prominent terms in the collection of documents, h[OCRerr]sed on frequency, distril)uti()n, and "rarity" scores.) To this thesaurus were added the ternis retained from the hand-weighting of the original topics. This thesaurus fi)rmed the second routing/partitioning thesaurus. The entire 2-gigahyte TREC collection was rescored against this second routingipartitioni ng thesaurus and the highest ranking 2([OCRerr]([OCRerr]([OCRerr] documents were selected fi)r the final-query stage. The third, ()[OCRerr] final-query, stage involved, first, calculating an IDF/TF score fi)r each term and all term-contained words in the 2(X[OCRerr])-document set for the topic. The query for that topic was created l)y taking the IDF/TF weightings [OCRerr] the ternis from the originally chosen 5-1([OCRerr] relevant documents and automatically forming a query l)y coml)ining all these terms along with the topic-derived terms into a long query vector. A vector-space representation ([OCRerr]f the 2EX)([OCRerr] documents was generated; the query vector was used to identify the final set of 2()() ranked documents for each topic l)ased oil cosine similarity measures. D. Automatic[OCRerr]'ylly built queries (routing) 1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <det[OCRerr] 2. total coinpuler tilne to build query (cpu secoilds) ([OCRerr].()3 cpu seconds 3. which of the fi)llowing were used iii building [OCRerr]e query'? a. terms selected from (1) topic (3) only documents with relevuice j udgineilts b. tenn weighting (1) with weights based oil terms in topics Yes. Topic terms were initially hand weighted. c. phr[OCRerr]'ise extraction (1) from topics (3) from d(icuinents with relev[OCRerr]'uice judgments d. syntactic pξrsing (1) of topics (2) of [OCRerr]`ill irLining documents (3) of documents wi[OCRerr] relevance judgments g. tokenizer (rec()L[OCRerr]nizes d[OCRerr][OCRerr]tes, phone numbers, CoilliflOil pattenis) (1) which patterns [OCRerr]`ut tokenized'? Only simple acronyms such as "I.B.M." recognized as a unit. description) The routing queries were fi)rmed in two stages. The first stage was the construction of a routingipartitioning were automatically k. other (brief thesaurus. The routing/partitioning thesaurus was generated l)y CLARIT from the supplied list of relevant documents per topic. The text of the topic fields was parsed and added to the pseudo[OCRerr]thesaurus derived from the relevant d(wuments. (Each pseudo[OCRerr]thesaurus consists of automatically chosen noun phrases scoring ahove a certain threshold, when scored fi[OCRerr]r rarity, distrihution, and frequency in the relevant document set.) Partial noun phrases, derived from 499