SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Overview of the Second Text REtrieval Conference (TREC-2) chapter D. Harman National Institute of Standards and Technology D. K. Harman * ZIFF-- Articles from Computer Select disks (Ziff- Davis Publishing) * FR -- Federal Register (1989) * DOE -- Short abstracts from DOE publications Disk2 * WSJ --Wall Street Journal (1990, 1991, 1992) * --- AP Newswire (1988) * ---- Articles from Computer Select disks (Ziff- Davis Publishing) * --- Federal Register (1988) Disk3 * ----- San Jose Mercury News (1991) * AP--APNewswire(1990) * ---- Articles from Computer Select disks (Ziff- Davis Publishing) PAT--U.S.Patents(1993) The documents are uniformly formatted into an SGML- like structure, as can be seen in the following example. <DOC> <DOCNO> W5J880406-0090 <IDOCNO> <HL> AT&T Unveils Services to Upgrade Phone Net- works Under Global Plan </HL> [OCRerr]UTHOR> Janet Guyon (WSJ Stafi) <IAUTHOR1> <DATELINE> NEW YORK <IDATELINE> <TEXT> American Telephone & Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications equip- ment markets. AT&T said it is the first national long-distance car- rier to announce prices for specific services under a world-wide standardization plan to upgrade phone net- works. By announcing commercial services under the plan, which the industry calls the Integrated Services Digital Network, AT&T will influence evolving commu- nications standards to its advantage, consultants said, just as International Business Machines Corp. has cre- ated de facto computer standards favoring its products. from the Initial data appear, but these vary widely across the different sources. The documents have differing amounts of errors, which were not checked or corrected. Not only would this have been an impossible task, but the errors in the data provide a better simulation of the ThEC task. Errors in missing document separators or bad docu- ment numbers were screened out, although a few were missed and later reported as errors. Table 1 shows some basic document collection statistics. Note that although the collection sizes are roughly equlv- alent in megabytes, there is a range of document lengths from very short documents [OCRerr]OE) to very long (FR). Also the range of document lengths within a collection varies. For example, the documents from AP are similar in length (the median and the average length are very close), but the WSJ and ZIFF documents have a wider range of lengths. The documents from the Federal Regis- ter (FR) have a very wide range of lengths. 3.3 The Topics In designing the ThEC task, there was a conscious deci- sion made to provide "user need" statements rather than more traditional queries. Two major issues were involved in this decision. First there was a desire to allow a wide range of query construction methods by keeping the topic (the need statement) distinct from the query (the actual text submitted to the system). The second issue was the ability to increase the amount of information avallable about each topic, in particular to include with each topic a clear statement of what criteria make a document relevant. The topics were designed to mimic a real user's need, and were written by people who are actual users of a retrieval system. Although the subject domain of the topics was diverse, some consideration was given to the documents to be searched. The topics were constructed by doing trial retrievals against a sample of the document set, and then those topics that had roughly 25 to 100 hits in that sample were used. This created a range of broader and narrower topics. The following is one of the topics used in ThEC. <top> <head> Tipster Topic Description <num> Number: 066 <dom> Domain: Science and Technology <title> Topic: Natural Language Processing <desc> Description: <ITEXT> Document will identijy a type of natural language pro- <IDOC> cessing technology which is being developed or mar- keted in the U.S. All documents have beginning and end markers, and a unique DOCNO id field. Additionally other fields taken 4