D U C 2 0 0 1 G U I D E L I N E S
To further progress in summarization and enable researchers
participate in large-scale experiments, the National Institute of
Standards and Technology began a new evaluation series in the
area of text summarization, tentatively called the Document
Understanding Conference (DUC). The basic design for the evaluation
followed ideas in a recent summarization road map that was created by a
committee of researchers in summarization, headed by Daniel Marcu.
Plans called for the creation of reference data (documents and
summaries) for training and testing. The training data were
distributed in early March of 2001. The test data were distributed
in mid-June and results were due for evaluation the first of July
2001. A workshop was held in September to discuss these results
and to make plans for further evaluations.
- Call for participation in DUC 2001
- Here was the invitation to participate.
- Information for active participants
- This information was for individuals
and groups who responded to the call for participation (see above),
applied to participate in DUC 2001, and received the active
participants' userid/password. Included were various forms which active
participants needed to fill out, sign, and fax/mail to NIST.
| 1. Mar
||NIST sent out 30 training sets
| 1. May
||Guidelines and evaluation plans complete
||NIST sent out 30 test sets
| 1. Jul
||results submitted to NIST; evaluation started
| 1. Aug
||NIST sent evaluated results to participants
||notebook papers for DUC 2001 conference were due
||DUC 2001 workshop after SIGIR 2001 paper sessions in New Orleans
- NIST produced 60 reference sets, 30 for training and 30
for testing. Each set contained documents, per-document
summaries, and multi-document summaries, with sets defined
by different types of criteria such as event sets, opinion
sets, etc. The data were password-protected. The password was
provided by NIST to participants who had the required forms
(see above) on file at NIST.
There were three tasks defined as follows.
- Fully automatic summarization of a single newswire/newpaper
- Given such a document, a generic summary of the document
with a length of approximately 100 words (whitespace-delimited
tokens) was created.
Thirty sets of approximately 10 documents each were provided
as system input for this task. Once the test data
was received, no manual modification or augmentation of the test
data or the test system was allowed.
- Fully automatic summarization of multiple newswire/newpaper
documents (articles) on a single subject:
- Given a set of such documents, 4 generic summaries of the
entire set with lengths of aproximately 400, 200, 100, and 50 words
(whitespace-delimited tokens) were created.
Thirty document sets of approximately 10 documents each were
provided as system input for this task. Once the
test data was received, no manual modification or augmentation of
the test data or the test system was allowed.
- Exploratory summarization:
- Investigate alternative problems within summarization, novel
approaches to their solution, and/or specialized evaluation
Participating researchers could use the data from the other two tasks
and/or provide their own. Any deviation from fully automatic
approaches had to be clearly described.
- Submissions and Evaluation
For single document summaries there were 2 categories of
evaluation: that done by humans (mostly at NIST), and that done
automatically (outside of NIST). For multi-document
summarization, the plan was only to have human evaluation.
Automatic evaluation employed techniques noted in the road map,
with implementation and execution of these the responsibility of
interested participants. NIST defined a standardized format
for submissions from summarization systems. Here is a
DTD for the format and a
The entire set of summaries generated by a given system
was to be submitted as simple ASCII text in one email to
Lori.Buckland@nist.gov using the format defined above. Within
that submission, summaries were to be ordered by ascending docset
id. Within a docset, the multi-document summaries first (the
50-, then 100-, 200-, and finally the 400-word summary) followed
by the single-document summaries in ascending order of document
Human evaluation was done at NIST using the same personnel who
created the reference data. These people did pairwise comparisons
of the reference summaries to the system-generated summaries, other
reference summaries, and baseline summaries.
NIST used a modified version of the SEE software
from ISI to support the
human evaluation protocol.
An unmodified version of SEE
is available now.
NIST returned to each participating group the raw
comparison data from SEE for their submitted summaries by
1. August. NIST also made all submitted summaries available
to all participating groups around mid-July.
At the workshop NIST presented an overview of the results.
Because this is the first use of the DUC-2001 evaluation procedure
the emphasis will be on low-level diagnostic information rather
than on a single score per system. NIST also provided data on
human performance such as consistency of judgments and upper bounds
of performance. This included doing multiple judgments on some subset
of the results, judging reference summaries against other
human-generated summaries, and in general investigating
sources of variability.