TREC Terabyte Track

Track coordinators: Charles Clarke (University of Waterloo)

Falk Scholer (RMIT)

Ian Soboroff (NIST)

Introduction

Early retrieval test collections were small, allowing relevance judgments to be based on an exhaustive examination of the documents, but limiting the general applicability of the findings. Karen Sparck Jones and Keith van Rijsbergen proposed a way of building significantly larger test collections by using pooling, a procedure adopted and subsequently validated by TREC. Now TREC-sized collections (several gigabytes of text and a few million documents) are small for some realistic tasks, but current pooling practices do not scale to substantially larger document sets. The goal of this track is to develop an evaluation methodology for terabyte-scale document collections.

At SIGIR 2003, Ian Soboroff, Ellen Voorhees, and Nick Craswell organized a workshop on this topic, with the goal of the workshop being a TREC track proposal for a retrieval experiment using a document collection on the order of a terabyte in size. This web site contains information about the workshop and the proposed track.

The goals of the track are twofold. First, we expect that retrieval algorithms may perform differently at very large scales. In the early TREC collections, for example, document length normalization was much more important than it had been in the smaller collections that preceeded them, such as Cranfield.

Second, we expect that evaluation methodologies will need to be revised to deal more effectively with incomplete relevance information. In the current TREC collections, relevance judgments are incomplete but the topics are still reusable. With a terabyte or more of data, the chances are much higher that new systems will discover relevant documents which have not been judged at all. New measures and new methods of performing relevance judgments may be needed.

Tasks

The main task for the terabyte track is ad hoc informational search. While not a web-centric task, ad hoc search has the advantage of being well-studied and of providing a basis of comparison to smaller collections. This certainly does not preclude participants from using link structure, anchor text, or other "web data" in their search processes.

In contrast to the TREC newswire collections, it is assumed that relevance information will be incomplete, perhaps catastrophically so for some topics. This is not by design, but rather a side effect of the pooling process. The goal of the track is to investigate this phenomenon, it's effects, and possible solutions.

Data

The track is currently using a collection of Web data crawled from Web sites in the .gov domain during early 2004. We believe that this collection ("GOV2") contains a large proportion of the crawlable pages in .gov, including HTML and text, plus the extracted text of PDF, Word, and Postscript files. By focusing the track on a single, large, interconnected domain we hoped to create a realistic setting, where content, structure, and links could all be fruitfuly exploited in the retrieval process.

The GOV2 collection is 426GB in size and contains 25 million documents. While this collection contains less than a full terabyte of data, it is considerably larger than the collections used in previous TREC tracks. For TREC 2004, the collection was distributed by CSIRO on a single hard drive. For TREC 2005 and forward, this collection is available from the University of Glasgow.

Past track guidelines, topics, and relevance judgments are available below. Internet Explorer users should right-click or shift-click to download the topics files, instead of viewing them in the browser.

TREC 2004:	Guidelines Topics Relevance judgments
TREC 2005:	Guidelines Adhoc topics Relevance judgments Efficiency topics Mapping to adhoc topics Named page topics Named page judgments

Other data files are available from the TREC data archive.

Overviews and other references

TREC 2004 Terabyte Track overview

A summary of the SIGIR 2003 workshop appeared in the Fall 2003 issue of SIGIR Forum.

Mailing List

The proto-track mailing list is at trec-tb @ nist . gov. To subscribe, send email as follows:

To: listproc @ nist . gov 
Subject: (leave empty)

subscribe trec-tb your-full-name 
(Email addresses above are lightly munged to confuse spammers. To use them, remove the spaces around the dot and at-sign)

Once you subscribe to the mailing list, Archives of the list will catch you up on what you've missed. This list is password protected; you'll receive the password via email after you subscribe.

Contact: Ian Soboroff
Last updated: Wednesday, 22-Feb-2006 08:31:45 MST
Date created: 27-Aug-03 09:15:00 The Retrieval Group
is part of the Information Access Division
in the Information Technology Laboratory
at