SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Overview of the First Text REtrieval Conference (TREC-1)
chapter
D. Harman
National Institute of Standards and Technology
Donna K. Harman
are very close), but the WSJ and ZIFF documents have a wider range of lengths. The documents from the
Federal Register (FR) have a very wide range of lengths.
The distribution of terms in these subsets show interesting variations. Table 2 shows some term distribution
statistics found using a small stopword list of 25 terms and no stemming. For example the A? has more unique
terms than the others, probably reflecting both more proper names and more spelling errors. The DOE collec-
tion, while very small, is highly technical and has many domains, resulting in many specific technical terms.
TABLE 2. DICTIONARY STATI[OCRerr]CS
Subset of collection WSJ AP ZIFF FR DOE
Totalnumberof
unique terms
(disk 1) 156,298 197,608 173,501 126,258 186,225
(disk2) 153,725 186,500 147A05 116,586
Occurring once
(disk 1) 64,656 89,627 85,992 58,677 95,782
(disk 2) 64,844 83,019 72,053 54,823
Occurring more> 1
(disk 1) 91,642 107,981 87,509 67,581 90,443
(disk2) 88,881 103A81 75,352 61,763 _________
Average number of
occurrences> 1
(disk 1) 199 174 165 106 159
(disk2) 178 169 139 91 _________
How does this document set compare with the
older collections? Table 3 shows
Not only has the size
the documents has at
lections with the Cranfield 1400 collection mentioned earlier.
increased by a factor of about 200, but the average length of
a comparison of these col-
of the document collection
least doubled, and in some
cases (FR), increased by a factor of 10. Also, the dictionary sizes have increased by a factor of 20.
TABLE 3. COMPARISON TO OLDER COLLECTIONS
Subset of collection
Size of collection
(megabytes) 295 266 251 258 1.5
Number of records 98,736 84,930 75,180 26,207 1400
Median number of
termsperrecord 182 353 181 313 79
Average number of
termsperrecord 329 375 412 1017 88
Total number of
unique terms 156,298 197,608 173,501 126,258 8226
What does this mean to the TREC task? First, a major portion of the effort for TREC-1 was spent in the
system engineering necessary to handle the huge number of documents. This means that little time was left for
system tuning or experimental runs, and therefore the TREC-1 results can best be viewed as a baseline for later
research. The longer documents also required major adjustments to the algorithms themselves (or loss of perfor-
mance). This is particularly true for the very long documents in FR. Since a relevant document might contain
only one or two relevant sentences, many algorithms needed adjustment from working with the abstract length
documents found in the old collections. Additionally many documents were composite stories, with different
topics, and this caused problems for most algorithms.
7