D U C 2 0 0 5: Task, Documents, and Measures

What's new for DUC 2005

DUC 2005 marks a major change in direction from last year. The road mapping committee strongly recommended that new tasks be undertaken that are strongly tied to a clear user application. The report writing task discussed at the May meeting was obviously that, but since then there has been serious discussion in the program committee about working on some new evaluation methodologies and metrics that would better address the human variation issues discussed in Barcelona. Therefore it was decided that the main thrust of DUC 2005 would be to have a simpler (but still user-oriented) task that would allow the whole community to put some of their effort/time into helping with this new evaluation framework during 2005. The same task will likely be continued in DUC 2006 (which should happen in the late spring of that year) and will focus again on system performance within the improved framework.

The system task in 2005 will be to synthesize from a set of 25-50 documents a brief, well-organized, fluent answer to a need for information that cannot be met by just stating a name, date, quantity, etc. This task will model real-world complex question answering and was suggested by "An Empirical Study of Information Synthesis Tasks" (Enrique Amigo, Julio Gonzalo, Victor Peinado, Anselmo Penas, Felisa Verdejo; {enrique,julio,victor,anselmo,felisa}@lsi.uned.es).

The main goals in DUC 2005 and their associated actions are listed below.

1) Inclusion of user/task context information for systems and human summarizers

create DUC topics which explicitly reflect the specific interests of a potential user in a task context
capture some general user/task preferences in a simple user profile

2) Evaluation of content in terms of more basic units of meaning

develop automatic tools and/or manual procedures to identify basic units of meaning
develop automatic tools and/or manual procedures to estimate the importance of such units based on agreement among humans
use the above in evaluating systems
evaluate the new evaluation scheme(s)

3) Better understanding of normal human variability in a summarization task and how it may affect evaluation of summarization systems

create as many manual reference summaries as feasible
examine the relationship between the number of reference summaries, the ways in which they vary, and the effect of the number and variability on system evaluation (absolute scoring, relative system differences, reliability, etc.)

Documents for summarization

NIST assessors will be allowed to choose TREC topics of interest to them. Each of these topics will have at least 35 relevant documents associated with it. The assessor will read the documents for a topic, verify the relevance of each, look for aspects of the topic of particular interest, create a DUC topic reflecting the particular interest, and choose a subset of 25 - 50 documents relevant to the DUC topic. These documents will be the DUC test document cluster for that topic. The assessor will also specify the desired granularity of the summary ("general" or "specific") in a user profile. Here are the instructions given to the assessors for creating the topics.

The documents will come from the following collections with their own taggings (see DTD):

Financial Times of London (DTD)
Los Angeles Times (DTD)

Test documents will be distributed by NIST via the Internet.

Reference summaries

The NIST assessor who developed the DUC topic will create a ~250-word summary of the cluster that meets the need expressed in the topic. The summary will be written at a level of granularity consistent with the granularity requested in the user profile. For each topic, other NIST assessors will also be given the user profile, DUC topic, and document cluster and will be asked to create a summary that meets the needs expressed in the topic and user profile. These multiple reference summaries will be used in the evaluation. It is our intention, if funding can be secured, to create a total of 4 references summaries for each of 30 of the topics and 10 reference summaries for each of 20 of the topics. Here are the instructions given to the assessors for writing the summaries. Here are example summaries for two topics.

System task

System task: Given a user profile, a DUC topic, and a cluster of documents relevant to the DUC topic, create from the documents a brief, well-organized, fluent summary which answers the need for information expressed in the topic, at the level of granularity specified in the user profile.

The summary can be no longer than 250 words (whitespace-delimited tokens). Summaries over the size limit will be truncated. No bonus will be given for creating a shorter summary. No specific formatting other than linear is allowed.

The summary should include (in some form or other) all the information in the documents that contributes to meeting the information need. Some generalization may be required to fit everything in.

Each group can submit one set of results, i.e., one summary for each topic/cluster. Participating groups should be able to evaluate additional results themselves using automatic evaluation tools developed by ISI.

Evaluation

NIST

NIST's role in the evaluation will be limited since it was considered more important in 2005 for NIST to apply its resources to creating more reference summaries than have been available in the past.

NIST will manually evaluate the linguistic well-formedness of each submitted summary using a set of quality questions.
NIST will manually evaluate the relative responsiveness of each submitted summary to the topic. Here are instructions to the assessors for judging responsiveness.

NIST will run ROUGE-1.5.5 with the following parameters:

ROUGE-1.5.5.pl -n 2 -x -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -d

-n 2	compute ROUGE-1 and ROUGE-2
-x	do not calculate ROUGE-L
-m	apply Porter stemmer on both models and peers
-2 4	compute Skip Bigram (ROUGE-S) with a maximum skip distance of 4
-u	include unigram in Skip Bigram (ROUGE-S)
-c 95	use 95% confidence interval
-r 1000	bootstrap resample 1000 times to estimate the 95% confidence interval
-f A	scores are averaged over multiple models
-p 0.5	compute F-measure with alpha = 0.5
-t 0	use model unit as the counting unit
-d	print per-evaluation scores

Jackknifing will be implemented so that human and system scores can be compared. This set of parameters will compute a number of ROUGE scores, but only the recall scores of ROUGE-2 and ROUGE-SU4 will be used as the official ROUGE scores. Participants are encouraged to calculate and discuss other metrics as well.

All summaries will first be truncated to 250 words. Where sentences need to be identified for automatic evaluation, NIST will then use a simple Perl script for sentence segmentation.

ISI, Columbia, and others

The main evaluation of how well each submitted summary agrees in content with the manually created reference summaries will be carried out cooperatively by (hopefully) most of the participating groups under the leadership of ISI/USC and Columbia University. This evaluation will explore both automatic and manual approaches.

Tools for DUC 2005

ISI's webpage on Basic Elements. Download also includes ROUGE version 1.5.5.
Columbia's webpage on Pyramids

For data, past results, mailing list or other general information
contact: Lori Buckland ([email protected])
For other questions contact: Hoa Dang (hoa.dang AT nist.gov)
Last updated:
Date created: Wednesday, 24-November-05