Document
Understanding
Conferences

Introduction
Publications
Data
Guidelines

DUC 2002: Length-Adjusted Coverage

DUC 2001 measured coverage - how much of a model summary's content was expressed by a system-generated peer summary. For DUC 2002 there was a desire also to:

look at the ability of a system to produce a summary shorter than the predefined target length
devise a combined measure of coverage and compression.

The following change was made to SEE:

On coverage so that the user answers the same question as in DUC 2001 but using percentages as answers. We want to treat this openly as a ratio scale - interval scale with a real zero.

     The marked PUs, taken together, express roughly (choose one)
        0%   20%   40%   60%   80%   100%
     of the meaning expressed by the current model unit.

This can be a ratio scale and even in 2001 there was evidence the assessors associated percentages with the various choices (All, Most, Some,...), but probably not using equal intervals.

Defining coverage (c) ranging from 0..1

   = fraction of model's meaning expressed by the peer (see 1. above)

Creating a measure brevity (b) ranging from 0..1

   = if actual peer length > predefined target length then 
        b = 0
     else 
        b = (predefined target length - actual peer length)/pre-defined target length

Length to be measured in words (number of whitespace-delimited strings)
Handling (demoting) summaries larger than the target complicates the scaling and more importantly allows systems to try to boost their composite score by loosing on brevity but perhaps gaining in coverage. We're not interested in encouraging this in DUC 2002, so don't suggest enabling it.
We've chosen to assume that groups will be near the target length or below it. If summaries significantly longer than the target are submitted, our proposal would be to truncate them.
We measure against the pre-defined target lengths (200,100,50,10) rather than abolishing them and using the composite measure to compare all summaries to the largest(200) because the 4 target sizes are just convenient ways of designating 4 significantly different sorts of summaries - different enough that comparing one to another would not make sense.
The measure rewards equal compression ratios equally: summarizing a 50-word target in 25 words is as valuable as summarizing a 400-word target in 200 words. One could alternatively argue for a measure that rewards brevity proportional to the absolute number of words saved, but again, we are treating the different target sizes as defining different categories of summaries so cross-category comparision will not be appropriate anyway.

Letting the new composite measure X (yet to be named) be a weighted arithmetic mean of coverage and brevity, ranging from 0..1:

        X = a*c + (1-a)b  where a ranges from 0..1 and controls the
                          relative importance of coverage and brevity

Weighting is needed because different applications/users will likely prefer different tradeoffs between coverage and brevity.
The use of the geometric and harmonic means is not advisable because they are relatively insensitive in the range of small values we expect for for coverage and or brevity (including 0).

For reporting purposes NIST will use two settings of a:

        a = 1   (brevity doesn't matter, continuity with DUC 2001)
        a = 2/3 (coverage is twice as important as brevity)

Some groups may not want to work on creating summaries shorter than the targets and should not be penalized. For them a = 1 makes sense.

For data, past results, mailing list or other general information
contact: Lori Buckland ([email protected])
For other questions contact: Paul Over ([email protected])
Last updated:
Date created: Friday, 26-July-02