SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Overview of the First Text REtrieval Conference (TREC-1) chapter D. Harman National Institute of Standards and Technology Donna K. Harman The documents were uniformly formatted into an SGML-like structure, as can he seen in the following example. <[)OC> [OCRerr]OCNO> WSJ880406-0090 <L>OCNO> <HL> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan <HL> [OCRerr]UTHOR> Janet Guyon (WSJ Staff) <AUTHOR1> <DAmLINE> NEW YORK <DAmLINE> [OCRerr]IEXT> American Telephone &Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications equipment markets. AT&T said it is the first national longAistance carrier to announce prices for specific services under a world-wide standardization plan to upgrade phone networks. By announcing commercial services under the plan, which the industry calls the Integrated Services Digital Network, AT&T will influence evolving communications standards to its advantage, consultants said, just as International Business Machines Corp. has created de facto computer standards favoring its products. <[EXT> <DOC> All documents had beginning and end markers, and a unique DOCNO id field. Additionally other fields taken from the initial data appeared, but these varied widely across the different sources. The documents also had different amounts of errors, which were not checked or corrected. Not only would this have been an impossible task, but the errors in the data provided a better simulation of the real-world task. Errors in missing document separators or bad document numbers were screened out, although a few were missed and later reported by participants. Table 1 shows some basic document collection statistics. TABLE 1. DOCUMENT ST[OCRerr]STICS Subset of collection WSJ AP ZIFF ER DOE Size of collection (megabytes) (disk 1) 295 266 251 258 190 (disk2) 255 248 188 211 Number of records (disk 1) 98,736 84,930 75,180 26,207 226,087 (disk 2) 74,520 79,923 56,920 20,108 Median number of terms per record (disk 1) 182 353 181 313 82 (disk2) 218 346 167 315 __________ Average number of terms per record (disk 1) 329 375 412 1017 89 (disk 2) 377 370 394 1073 ___________ Note that although the collection sizes are roughly equivalent in megabytes, there is a range of document lengths from very short documents ([)OE) to very long (FR). Also the range of document lengths within a col- lection varies. For example, the documents from AP are similar in length (the median and the average length 6