Document
Understanding
Conferences

Introduction
Publications
Data
Guidelines

D U C 2 0 0 1 G U I D E L I N E S

To further progress in summarization and enable researchers participate in large-scale experiments, the National Institute of Standards and Technology began a new evaluation series in the area of text summarization, tentatively called the Document Understanding Conference (DUC). The basic design for the evaluation followed ideas in a recent summarization road map that was created by a committee of researchers in summarization, headed by Daniel Marcu. Plans called for the creation of reference data (documents and summaries) for training and testing. The training data were distributed in early March of 2001. The test data were distributed in mid-June and results were due for evaluation the first of July 2001. A workshop was held in September to discuss these results and to make plans for further evaluations.

Call for participation in DUC 2001

Here was the invitation to participate.

Information for active participants

This information was for individuals and groups who responded to the call for participation (see above), applied to participate in DUC 2001, and received the active participants' userid/password. Included were various forms which active participants needed to fill out, sign, and fax/mail to NIST.

Schedule

1. Mar	NIST sent out 30 training sets
1. May	Guidelines and evaluation plans complete
15. Jun	NIST sent out 30 test sets
1. Jul	results submitted to NIST; evaluation started
1. Aug	NIST sent evaluated results to participants
25. Aug	notebook papers for DUC 2001 conference were due
13-14. Sep	DUC 2001 workshop after SIGIR 2001 paper sessions in New Orleans

Data

NIST produced 60 reference sets, 30 for training and 30 for testing. Each set contained documents, per-document summaries, and multi-document summaries, with sets defined by different types of criteria such as event sets, opinion sets, etc. The data were password-protected. The password was provided by NIST to participants who had the required forms (see above) on file at NIST.

Tasks

There were three tasks defined as follows.

Fully automatic summarization of a single newswire/newpaper document (article):

Given such a document, a generic summary of the document with a length of approximately 100 words (whitespace-delimited tokens) was created.
Thirty sets of approximately 10 documents each were provided as system input for this task. Once the test data was received, no manual modification or augmentation of the test data or the test system was allowed.

Fully automatic summarization of multiple newswire/newpaper documents (articles) on a single subject:

Given a set of such documents, 4 generic summaries of the entire set with lengths of aproximately 400, 200, 100, and 50 words (whitespace-delimited tokens) were created.
Thirty document sets of approximately 10 documents each were provided as system input for this task. Once the test data was received, no manual modification or augmentation of the test data or the test system was allowed.

Exploratory summarization:

Investigate alternative problems within summarization, novel approaches to their solution, and/or specialized evaluation strategies.
Participating researchers could use the data from the other two tasks and/or provide their own. Any deviation from fully automatic approaches had to be clearly described.

Submissions and Evaluation

For single document summaries there were 2 categories of evaluation: that done by humans (mostly at NIST), and that done automatically (outside of NIST). For multi-document summarization, the plan was only to have human evaluation.

Automatic evaluation employed techniques noted in the road map, with implementation and execution of these the responsibility of interested participants. NIST defined a standardized format for submissions from summarization systems. Here is a DTD for the format and a little example.

The entire set of summaries generated by a given system was to be submitted as simple ASCII text in one email to [email protected] using the format defined above. Within that submission, summaries were to be ordered by ascending docset id. Within a docset, the multi-document summaries first (the 50-, then 100-, 200-, and finally the 400-word summary) followed by the single-document summaries in ascending order of document number.

Human evaluation was done at NIST using the same personnel who created the reference data. These people did pairwise comparisons of the reference summaries to the system-generated summaries, other reference summaries, and baseline summaries.

NIST used a modified version of the SEE software from ISI to support the human evaluation protocol. An unmodified version of SEE is available now.

NIST returned to each participating group the raw comparison data from SEE for their submitted summaries by 1. August. NIST also made all submitted summaries available to all participating groups around mid-July.

For participants - the raw evaluation results, submissions, models, baselines, and supporting information.

At the workshop NIST presented an overview of the results. Because this is the first use of the DUC-2001 evaluation procedure the emphasis will be on low-level diagnostic information rather than on a single score per system. NIST also provided data on human performance such as consistency of judgments and upper bounds of performance. This included doing multiple judgments on some subset of the results, judging reference summaries against other human-generated summaries, and in general investigating sources of variability.

For data, past results or other general information
contact: Nicole Baten (nicole DOT baten AT nist.gov)
For other questions contact: Paul Over ([email protected])
Last updated:
Date created: Friday, 26-July-02