Cranfield Tradition
Note the emphasis on comparative !!
- absolute score of some effectiveness measure not meaningful
- absolute score changes when assessor changes
- query variability not accounted for
- impact of collection size, etc. not accounted for
- theoretical maximum of 1.0 for both recall & precision not obtainable by humans
- evaluation results are only comparable when they are from the same collection
- a subset of a collection is a different collection
- direct comparison of scores from two different TREC collections is invalid