SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Combining Evidence from Multiple Searches chapter E. Fox M. Koushik J. Shaw R. Modlin D. Rao National Institute of Standards and Technology Donna K. Harman of the SMART Information Retrieval System [1]. The following specifications were used during the indexing process: 1. No stemming was done. 2. A stop word list of 418 words was used. 3. The Heading, Text, and Summary sections were included. 4. A controlled vocabulary was not included. Briefly, the document text is tokenized, stop words are deleted, and non-noise words are included in the term dictionary along with their occurrence frequencies. Each term in the dictionary has a unique identification number. A document vector file is also created during indexing which contains for each document its unique ID, and a vector of term IDs and term weights. The weighting scheme itself is fairly flexible and can be changed to one of several schemes after the indexing is complete. Indexing the WSJ created a dictionary of approximately 15 MB and a document vector file of 121 MB. The other 4 collections take up space proportional to their sizes. 3 Retrieval Approach 3.1 Retrieval Runs Several retrieval runs were then made as outlined below: * Retrieval based on the vector model The topics were indexed considering the Description, Narrative, and Concepts sections to form vector queries. Retrieval was performed by varying the weighting on the document and query vectors. Three different methods were tried: 1. Weighting with tf 2. Weighting with (tf/max(tf)) * idf 3. Weighting with (0.5 + 0.5 * tf/max(tf)) * idf- called aug[OCRerr]norm where tf = term frequency, and idf = inverse document frequency. During retrieval, the query-document similarities were computed in two different ways for each of these weighting schemes: cosine and inner product. * Retrieval based on the Boolean model Boolean queries were manually generated by team members (with a background in computer science), and a retrieval run was made using these queries. The queries were composed, for the most part, using the entire text of the topic descriptions provided, and occasion- ally broader/narrower terms were obtained from general domain knowledge. The Boolean operators used were AND and OR. 320