SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Overview of the Second Text REtrieval Conference (TREC-2)
chapter
D. Harman
National Institute of Standards and Technology
D. K. Harman
* ZIFF-- Articles from Computer Select disks (Ziff-
Davis Publishing)
* FR -- Federal Register (1989)
* DOE -- Short abstracts from DOE publications
Disk2
* WSJ --Wall Street Journal (1990, 1991, 1992)
* --- AP Newswire (1988)
* ---- Articles from Computer Select disks (Ziff-
Davis Publishing)
* --- Federal Register (1988)
Disk3
* ----- San Jose Mercury News (1991)
* AP--APNewswire(1990)
* ---- Articles from Computer Select disks (Ziff-
Davis Publishing)
PAT--U.S.Patents(1993)
The documents are uniformly formatted into an SGML-
like structure, as can be seen in the following example.
<DOC>
<DOCNO> W5J880406-0090 <IDOCNO>
<HL> AT&T Unveils Services to Upgrade Phone Net-
works Under Global Plan </HL>
[OCRerr]UTHOR> Janet Guyon (WSJ Stafi) <IAUTHOR1>
<DATELINE> NEW YORK <IDATELINE>
<TEXT>
American Telephone & Telegraph Co. introduced the
first of a new generation of phone services with broad
implications for computer and communications equip-
ment markets.
AT&T said it is the first national long-distance car-
rier to announce prices for specific services under a
world-wide standardization plan to upgrade phone net-
works. By announcing commercial services under the
plan, which the industry calls the Integrated Services
Digital Network, AT&T will influence evolving commu-
nications standards to its advantage, consultants said,
just as International Business Machines Corp. has cre-
ated de facto computer standards favoring its products.
from the Initial data appear, but these vary widely across
the different sources. The documents have differing
amounts of errors, which were not checked or corrected.
Not only would this have been an impossible task, but the
errors in the data provide a better simulation of the ThEC
task. Errors in missing document separators or bad docu-
ment numbers were screened out, although a few were
missed and later reported as errors.
Table 1 shows some basic document collection statistics.
Note that although the collection sizes are roughly equlv-
alent in megabytes, there is a range of document lengths
from very short documents [OCRerr]OE) to very long (FR).
Also the range of document lengths within a collection
varies. For example, the documents from AP are similar
in length (the median and the average length are very
close), but the WSJ and ZIFF documents have a wider
range of lengths. The documents from the Federal Regis-
ter (FR) have a very wide range of lengths.
3.3 The Topics
In designing the ThEC task, there was a conscious deci-
sion made to provide "user need" statements rather than
more traditional queries. Two major issues were involved
in this decision. First there was a desire to allow a wide
range of query construction methods by keeping the topic
(the need statement) distinct from the query (the actual
text submitted to the system). The second issue was the
ability to increase the amount of information avallable
about each topic, in particular to include with each topic a
clear statement of what criteria make a document relevant.
The topics were designed to mimic a real user's need, and
were written by people who are actual users of a retrieval
system. Although the subject domain of the topics was
diverse, some consideration was given to the documents
to be searched. The topics were constructed by doing trial
retrievals against a sample of the document set, and then
those topics that had roughly 25 to 100 hits in that sample
were used. This created a range of broader and narrower
topics.
The following is one of the topics used in ThEC.
<top>
<head> Tipster Topic Description
<num> Number: 066
<dom> Domain: Science and Technology
<title> Topic: Natural Language Processing
<desc> Description:
<ITEXT> Document will identijy a type of natural language pro-
<IDOC> cessing technology which is being developed or mar-
keted in the U.S.
All documents have beginning and end markers, and a
unique DOCNO id field. Additionally other fields taken
4