Document Understanding Conferences
Introduction
Publications
Data
Guidelines
|
|
DUC 2002: Length-Adjusted Coverage
DUC 2001 measured coverage - how much of a model summary's content
was expressed by a system-generated peer summary. For DUC 2002 there was
a desire also to:
-
look at the ability of a system to produce a summary shorter than the predefined
target length
-
devise a combined measure of coverage and compression.
The following change was made to SEE:
-
On coverage so that the user answers the
same question as in DUC 2001 but using percentages as answers. We want
to treat this openly as a ratio scale - interval scale with a real zero.
The marked PUs, taken together, express roughly (choose one)
0% 20% 40% 60% 80% 100%
of the meaning expressed by the current model unit.
-
This can be a ratio scale and even in 2001 there was evidence the assessors
associated percentages with the various choices (All, Most, Some,...),
but probably not using equal intervals.
-
Defining coverage (c) ranging from 0..1
= fraction of model's meaning expressed by the peer (see 1. above)
-
Creating a measure brevity (b) ranging from 0..1
= if actual peer length > predefined target length then
b = 0
else
b = (predefined target length - actual peer length)/pre-defined target length
- Length to be measured in words (number of whitespace-delimited strings)
-
Handling (demoting) summaries larger than the target complicates the scaling
and more importantly allows systems to try to boost their composite score
by loosing on brevity but perhaps gaining in coverage. We're not interested
in encouraging this in DUC 2002, so don't suggest enabling it.
-
We've chosen to assume that groups will be near the target length or below
it. If summaries significantly longer than the target are submitted, our
proposal would be to truncate them.
-
We measure against the pre-defined target lengths (200,100,50,10) rather
than abolishing them and using the composite measure to compare all summaries
to the largest(200) because the 4 target sizes are just convenient ways
of designating 4 significantly different sorts of summaries - different
enough that comparing one to another would not make sense.
-
The measure rewards equal compression ratios equally: summarizing
a 50-word target in 25 words is as valuable as summarizing a 400-word
target in 200 words. One could alternatively argue for a measure that
rewards brevity proportional to the absolute number of words saved, but
again, we are treating the different target sizes as defining different
categories of summaries so cross-category comparision will not be
appropriate anyway.
-
Letting the new composite measure X (yet to be named) be a weighted arithmetic
mean of coverage and brevity, ranging from 0..1:
X = a*c + (1-a)b where a ranges from 0..1 and controls the
relative importance of coverage and brevity
-
Weighting is needed because different applications/users will likely prefer
different tradeoffs between coverage and brevity.
-
The use of the geometric and harmonic means is not advisable because they
are relatively insensitive in the range of small values we expect for for
coverage and or brevity (including 0).
-
For reporting purposes NIST will use two settings of a:
a = 1 (brevity doesn't matter, continuity with DUC 2001)
a = 2/3 (coverage is twice as important as brevity)
-
Some groups may not want to work on creating summaries shorter than the
targets and should not be penalized. For them a = 1 makes sense.
|