Grammaticality across all summaries
Most scores relatively high
System score range very wide
Medians/means:
Baselines < Systems < Humans
But why are baselines (extractions) less than perfect?
Notches in box plots indicate 95% confidence
intervals around the mean if and only if:
- the sample is large (> 30), or
- the sample has an approximate normal