SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Overview of the First Text REtrieval Conference (TREC-1) chapter D. Harman National Institute of Standards and Technology Donna K. Harman are very close), but the WSJ and ZIFF documents have a wider range of lengths. The documents from the Federal Register (FR) have a very wide range of lengths. The distribution of terms in these subsets show interesting variations. Table 2 shows some term distribution statistics found using a small stopword list of 25 terms and no stemming. For example the A? has more unique terms than the others, probably reflecting both more proper names and more spelling errors. The DOE collec- tion, while very small, is highly technical and has many domains, resulting in many specific technical terms. TABLE 2. DICTIONARY STATI[OCRerr]CS Subset of collection WSJ AP ZIFF FR DOE Totalnumberof unique terms (disk 1) 156,298 197,608 173,501 126,258 186,225 (disk2) 153,725 186,500 147A05 116,586 Occurring once (disk 1) 64,656 89,627 85,992 58,677 95,782 (disk 2) 64,844 83,019 72,053 54,823 Occurring more> 1 (disk 1) 91,642 107,981 87,509 67,581 90,443 (disk2) 88,881 103A81 75,352 61,763 _________ Average number of occurrences> 1 (disk 1) 199 174 165 106 159 (disk2) 178 169 139 91 _________ How does this document set compare with the older collections? Table 3 shows Not only has the size the documents has at lections with the Cranfield 1400 collection mentioned earlier. increased by a factor of about 200, but the average length of a comparison of these col- of the document collection least doubled, and in some cases (FR), increased by a factor of 10. Also, the dictionary sizes have increased by a factor of 20. TABLE 3. COMPARISON TO OLDER COLLECTIONS Subset of collection Size of collection (megabytes) 295 266 251 258 1.5 Number of records 98,736 84,930 75,180 26,207 1400 Median number of termsperrecord 182 353 181 313 79 Average number of termsperrecord 329 375 412 1017 88 Total number of unique terms 156,298 197,608 173,501 126,258 8226 What does this mean to the TREC task? First, a major portion of the effort for TREC-1 was spent in the system engineering necessary to handle the huge number of documents. This means that little time was left for system tuning or experimental runs, and therefore the TREC-1 results can best be viewed as a baseline for later research. The longer documents also required major adjustments to the algorithms themselves (or loss of perfor- mance). This is particularly true for the very long documents in FR. Since a relevant document might contain only one or two relevant sentences, many algorithms needed adjustment from working with the abstract length documents found in the old collections. Additionally many documents were composite stories, with different topics, and this caused problems for most algorithms. 7