Document Understanding Conferences
Introduction
Publications
Data
Guidelines
|
|
D U C 2 0 0 2 G U I D E L I N E S
To further progress in summarization and enable researchers
to participate in large-scale experiments, the National Institute of
Standards and Technology (NIST)
continued an evaluation in the area of text summarization called the
Document Understanding Conference (DUC). DUC is part of a
Defense Advanced Research Projects Agency (DARPA) program,
Translingual Information Detection, Extraction, and Summarization
(TIDES),
which specifically calls for major advances in summarization
technology, both in English and from other languages to English
(cross-language summarization). The basic design for the evaluation
follows ideas in a summarization road map that was created by a
committee of researchers in summarization, headed by Daniel Marcu. It
also profited from the experiences of DUC 2001.
Plans called for the creation of reference data (documents and
summaries) for testing. DUC 2001
data was available as training data for DUC 2002, once the short
application had been submitted (see "Call for participation" below)
and the required permission forms were signed. No additional training
data was created due to the shortened schedule. The test data for
DUC 2002 was distributed at the end of March and results were
due for evaluation mid-April. The DUC 2002 workshop was held
as part of the ACL-2002 Automatic Summarization Workshop, July 11-12, 2002
in Philadelphia to discuss these results and to make plans for further
evaluations.
- Call for participation in DUC 2002
- Here was your
invitation to participate and instructions on how to apply.
- Information for active participants
-
This information was for individuals and groups who responded to
the call for participation (see above), applied to participate in
DUC 2002, and received the active participants'
userid/password.
Included was:
- Schedule
26. Nov
| NIST sent out a call for participation in DUC 2002
|
28. Feb
| Guidelines and evaluation plans completed
|
29. Mar
| NIST sent out test document sets
|
12. Apr
| Extended abstract were due at NIST (if you wanted to speak on day 2 of the workshop)
|
15. Apr
| Results submitted to NIST by midnight NIST time
|
7. Jun
| NIST sent evaluated results to participants
|
23. Jun
| Notebook papers for DUC 2002 conference were due
|
11-12. Jul
| DUC 2002 workshop - held as part of the ACL-02 Automatic Summarization Workshop in Philadelphia |
- Data
- NIST produced 60 reference sets.
Each set contained documents, single-document abstracts, and
multi-document abstracts/extracts, with sets defined by different
types of criteria such as event sets, biographical sets, etc. The
documents were available with and without "sentences" tagged
as defined by a version of the
simple sentence separation software used for DUC 2001.
Examples of a
document with and without the sentence tagging. NOTE: This means
the test data was in a slightly different format than the
training data. The test data was password-protected. The password
was provided by NIST to participants who had the required forms (see
above) on file at NIST.
- Tasks
-
There were three tasks defined as follows.
- Fully automatic summarization of a single newswire/newspaper
document (article):
-
Sixty sets of approximately 10 documents each were provided as
system input for this task. Once the test data was received, no
manual modification or augmentation of the test data or the test
system was allowed.
- A generic abstract of the
document with a length of approximately 100 words or less was created.
(whitespace-delimited tokens). The coverage metric took length
into account and rewarded conciseness. The abstracts were
composed entirely of complete sentences.
- Fully automatic summarization of multiple newswire/newspaper
documents (articles) on a single subject:
-
Sixty document sets of approximately 10 documents each were
provided as system input for this task. Once the test data was
received, no manual modification or augmentation of the test data
or the test system was allowed.
- Four generic abstracts of
the entire set with lengths of approximately 200, 100, 50, and 10
words (whitespace-delimited tokens) or less were created. The coverage metric took length
into account and rewarded conciseness. The 200-, 100-, and 50-word
abstracts were composed entirely of complete sentences. The
10-word abstract took the form of a headline.
- Given a set of such documents, 2 generic sentence
extracts of the entire set with lengths of approximately 400 and
200 (whitespace-delimited tokens) or less were created. Each such extract
consisted of some subset of the "sentences" predefined by
NIST in the sentence-separated document set. Each predefined
sentence was used in its entirety or not at all in constructing
an extract. NIST calculated at least sentence recall. Participants
may have been interested in using the MEAD evaluation
software developed as part of a summer
workshop on summarization of multiple (multilingual)
documents at Johns Hopkins University in 2001.
- One or two pilot projects with extrinsic evaluation
- Details concerning the pilot tasks and
their evaluation are still to be determined.
- Submissions and Evaluation
-
Abstracts were manually evaluated by NIST. (Participants who
wanted to explore automatic evaluation of abstracts were encouraged
to do so.)
Extracts were automatically evaluated employing techniques
noted in the road map. NIST calculated at least sentence recall.
NIST defined a standardized format for submissions
from summarization systems. Here is an SGML DTD for the format of the
abstracts/extracts and a little example.
The entire set of summaries generated by a given system
were submitted as simple ASCII text in the body of one email
to Lori.Buckland@nist.gov using the format defined above.
These were the counts of abstracts and extracts
we received for evaluation. The (in some cases abbreviated)
system ids are included to the right.
Abstracts |
|
Extracts |
------------------------------------ |
|
----------- |
Single |
|
Multi |
|
--- |
|
------------------------ |
100 |
|
10 |
50 |
100 |
200 |
|
200 |
400 |
|
567 |
|
0 |
0 |
0 |
0 |
|
0 |
0 |
|
bbn.head1n |
567 |
|
0 |
59 |
59 |
59 |
|
59 |
59 |
|
ccsnsa.v2 |
567 |
|
59 |
59 |
59 |
59 |
|
59 |
59 |
|
gleans.v1 |
567 |
|
0 |
0 |
0 |
0 |
|
0 |
0 |
|
imp_col |
567 |
|
59 |
59 |
59 |
59 |
|
59 |
59 |
|
kul.2002 |
566 |
|
59 |
59 |
59 |
59 |
|
59 |
59 |
|
lcc.duc02 |
0 |
|
0 |
59 |
59 |
59 |
|
59 |
59 |
|
lion_sum |
567 |
|
59 |
59 |
59 |
59 |
|
59 |
59 |
|
MICHIGAN |
559 |
|
0 |
0 |
0 |
0 |
|
0 |
0 |
|
MSRC |
567 |
|
0 |
0 |
0 |
0 |
|
0 |
0 |
|
ntt.duc02 |
565 |
|
0 |
0 |
0 |
0 |
|
0 |
0 |
|
SumUMFAR |
0 |
|
59 |
59 |
59 |
59 |
|
59 |
59 |
|
tno-duc02 |
567 |
|
0 |
0 |
0 |
0 |
|
59 |
59 |
|
ULeth131m |
0 |
|
0 |
0 |
0 |
0 |
|
59 |
59 |
|
unicorp.v36 |
567 |
|
0 |
0 |
0 |
0 |
|
0 |
0 |
|
uottawa |
0 |
|
59 |
59 |
59 |
59 |
|
0 |
0 |
|
webc12002 |
567 |
|
0 |
0 |
0 |
0 |
|
59 |
59 |
|
wpdv-xtr.v1 |
---- |
|
--- |
--- |
--- |
--- |
|
--- |
--- |
7359 |
|
354 |
472 |
472 |
472 |
|
590 |
590 |
------------------------------------ |
|
----------- |
9129 |
|
1180 |
Human evaluation was done at NIST using the same personnel who
created the reference data. These people did pairwise comparisons
of the reference abstracts to the system-generated abstracts, other
reference abstracts, and
baseline abstracts.
Metrics included a measure of the extent to which the
automatically created summaries expressed the meaning in the
reference summaries (coverage). The length of the automatically
created abstracts was taken into account in calculating the
coverage measure and conciseness was
rewarded. Grammaticality, coherence, and organization of
abstracts was meassured wherever appropriate.
NIST used a modified version of the SEE software
from ISI to support the
human evaluation protocol.
An unmodified version of SEE
is available.
- Results for participants
-
Here are the evaluation results for
DUC 2002 participants. The password for access to past results is required.
Contact Lori Buckland (lori.buckland@nist.gov) as needed.
|