Document
Understanding
Conferences

Introduction
Publications
Data
Guidelines

D U C 2 0 0 2 G U I D E L I N E S

To further progress in summarization and enable researchers to participate in large-scale experiments, the National Institute of Standards and Technology (NIST) continued an evaluation in the area of text summarization called the Document Understanding Conference (DUC). DUC is part of a Defense Advanced Research Projects Agency (DARPA) program, Translingual Information Detection, Extraction, and Summarization (TIDES), which specifically calls for major advances in summarization technology, both in English and from other languages to English (cross-language summarization). The basic design for the evaluation follows ideas in a summarization road map that was created by a committee of researchers in summarization, headed by Daniel Marcu. It also profited from the experiences of DUC 2001.

Plans called for the creation of reference data (documents and summaries) for testing. DUC 2001 data was available as training data for DUC 2002, once the short application had been submitted (see "Call for participation" below) and the required permission forms were signed. No additional training data was created due to the shortened schedule. The test data for DUC 2002 was distributed at the end of March and results were due for evaluation mid-April. The DUC 2002 workshop was held as part of the ACL-2002 Automatic Summarization Workshop, July 11-12, 2002 in Philadelphia to discuss these results and to make plans for further evaluations.

Call for participation in DUC 2002

Here was your invitation to participate and instructions on how to apply.

Information for active participants

This information was for individuals and groups who responded to the call for participation (see above), applied to participate in DUC 2002, and received the active participants' userid/password.
Included was:

an archive of email that was sent to the active participants' mailing list.

Schedule

26. Nov	NIST sent out a call for participation in DUC 2002
28. Feb	Guidelines and evaluation plans completed
29. Mar	NIST sent out test document sets
12. Apr	Extended abstract were due at NIST (if you wanted to speak on day 2 of the workshop)
15. Apr	Results submitted to NIST by midnight NIST time
7. Jun	NIST sent evaluated results to participants
23. Jun	Notebook papers for DUC 2002 conference were due
11-12. Jul	DUC 2002 workshop - held as part of the ACL-02 Automatic Summarization Workshop in Philadelphia

Data

NIST produced 60 reference sets. Each set contained documents, single-document abstracts, and multi-document abstracts/extracts, with sets defined by different types of criteria such as event sets, biographical sets, etc. The documents were available with and without "sentences" tagged as defined by a version of the simple sentence separation software used for DUC 2001. Examples of a document with and without the sentence tagging. NOTE: This means the test data was in a slightly different format than the training data. The test data was password-protected. The password was provided by NIST to participants who had the required forms (see above) on file at NIST.

Tasks

There were three tasks defined as follows.

Fully automatic summarization of a single newswire/newspaper document (article):

Sixty sets of approximately 10 documents each were provided as system input for this task. Once the test data was received, no manual modification or augmentation of the test data or the test system was allowed.
A generic abstract of the document with a length of approximately 100 words or less was created. (whitespace-delimited tokens). The coverage metric took length into account and rewarded conciseness. The abstracts were composed entirely of complete sentences.

Fully automatic summarization of multiple newswire/newspaper documents (articles) on a single subject:

Sixty document sets of approximately 10 documents each were provided as system input for this task. Once the test data was received, no manual modification or augmentation of the test data or the test system was allowed.
Four generic abstracts of the entire set with lengths of approximately 200, 100, 50, and 10 words (whitespace-delimited tokens) or less were created. The coverage metric took length into account and rewarded conciseness. The 200-, 100-, and 50-word abstracts were composed entirely of complete sentences. The 10-word abstract took the form of a headline.
Given a set of such documents, 2 generic sentence extracts of the entire set with lengths of approximately 400 and 200 (whitespace-delimited tokens) or less were created. Each such extract consisted of some subset of the "sentences" predefined by NIST in the sentence-separated document set. Each predefined sentence was used in its entirety or not at all in constructing an extract. NIST calculated at least sentence recall. Participants may have been interested in using the MEAD evaluation software developed as part of a summer workshop on summarization of multiple (multilingual) documents at Johns Hopkins University in 2001.

One or two pilot projects with extrinsic evaluation

Details concerning the pilot tasks and their evaluation are still to be determined.

Submissions and Evaluation

Abstracts were manually evaluated by NIST. (Participants who wanted to explore automatic evaluation of abstracts were encouraged to do so.)

Extracts were automatically evaluated employing techniques noted in the road map. NIST calculated at least sentence recall.

NIST defined a standardized format for submissions from summarization systems. Here is an SGML DTD for the format of the abstracts/extracts and a little example.

The entire set of summaries generated by a given system were submitted as simple ASCII text in the body of one email to [email protected] using the format defined above.

These were the counts of abstracts and extracts we received for evaluation. The (in some cases abbreviated) system ids are included to the right.

Abstracts					Extracts
------------------------------------					-----------
Single	Multi
---	------------------------
100	10	50	100	200	200	400

567	0	0	0	0	0	0	bbn.head1n
567	0	59	59	59	59	59	ccsnsa.v2
567	59	59	59	59	59	59	gleans.v1
567	0	0	0	0	0	0	imp_col
567	59	59	59	59	59	59	kul.2002
566	59	59	59	59	59	59	lcc.duc02
0	0	59	59	59	59	59	lion_sum
567	59	59	59	59	59	59	MICHIGAN
559	0	0	0	0	0	0	MSRC
567	0	0	0	0	0	0	ntt.duc02
565	0	0	0	0	0	0	SumUMFAR
0	59	59	59	59	59	59	tno-duc02
567	0	0	0	0	59	59	ULeth131m
0	0	0	0	0	59	59	unicorp.v36
567	0	0	0	0	0	0	uottawa
0	59	59	59	59	0	0	webc12002
567	0	0	0	0	59	59	wpdv-xtr.v1
----	---	---	---	---	---	---
7359	354	472	472	472	590	590
------------------------------------					-----------
9129					1180

Human evaluation was done at NIST using the same personnel who created the reference data. These people did pairwise comparisons of the reference abstracts to the system-generated abstracts, other reference abstracts, and baseline abstracts.

Metrics included a measure of the extent to which the automatically created summaries expressed the meaning in the reference summaries (coverage). The length of the automatically created abstracts was taken into account in calculating the coverage measure and conciseness was rewarded. Grammaticality, coherence, and organization of abstracts was meassured wherever appropriate.

NIST used a modified version of the SEE software from ISI to support the human evaluation protocol. An unmodified version of SEE is available.

Results for participants

Here are the evaluation results for DUC 2002 participants. The password for access to past results is required. Contact Lori Buckland ([email protected]) as needed.

For data, past results or other general information
contact: Nicole Baten (nicole DOT baten AT nist.gov)
For other questions contact: Paul Over ([email protected])
Last updated:
Date created: Friday, 26-July-02