SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Overview of the First Text REtrieval Conference (TREC-1) chapter D. Harman National Institute of Standards and Technology Donna K. Harman TABLE 4. OVERLAP OF SUBMITTED RESULTS ______ Top 200 Top 200 Top 100 Top 100 Possible Actual Possible Actual Average Number of Unique Documents Per Topic 6600 2398A 3300 1278.86 (Adlioc, 33 runs, 16 groups) ________ ________ ________ Average Number of Unique. Documents Per Topic 4400 1932.42 2200 1066.86 [OCRerr]outing, 22 runs, 16 groups) ________ ________ ________ _______ For example, out of a maximum of 6600 unique documents (33 groups times 200 documents), over one-third were actually unique. The top 100 documents retrieved contained about the same percentage of unique docu- ments. This means that the different systems were finding different documents as likely relevant documents for a topic. Whereas this might be expected (and indeed has been shown to occur, Katzer et. al. 1982) from widely differing systems, these overlaps were often between two runs for a given system, or between two systems run on the same basic retrieval engine. One reason for the lack of overlap is the very large number of documents that contain many of the same keywords as the relevant documents, but probably a larger reason is the very different sets of keywords in the constructed queries (this needs further analysis). This lack of overlap should improve the coverage of the relevance set, and verifies the use of the pooling methodology to produce the sam- ple. The merged list of results was then shown to the human assessors. Only the top 100 documents were judged, resulting in an average of 1462.24 documents judged for each topic, and ranging from a high of 2893 for topic 74 to a low of 611 for topic 46. Each topic was judged by a single assessor to insure the best con- sistency of judgment and varying numbers of documents were judged relevant to the topics. Figure 3 shows the number of documents judged relevant for each of the 100 topics. The topics are sorted by the number of relevant documents to better show their range and median. 1000 900 800 [OCRerr]700 [OCRerr]6Oo c4-o500 [OCRerr]4oo z300 200 100 0 *iII Number of Relevant by Topic Topics Figure 3. Number of Relevant Documents on a Per Topic Basis. 10