QA Evaluation
Used human judges to assess correctness of response strings
- document provides context for answer
- NIST assessors trained for this task
- 3 independent assessments per question
- assessor judgments do differ
- final judgment set used adjudicated majority opinion…
- … but found such an expensive judgment set is not necessary