Inferred Average Precision and TRECVID 2006


Inferred average precision

Researchers at Northeastern University have developed several new methods and measures for estimating various standard information retrieval (IR) measures quite well using surprisingly small samples of the usual TREC-style pooled judgments. One of these measures, inferred average precision (infAP), is being used in evaluating the feature task submissions in TRECVID 2006 after testing on TRECVID 2005 data demonstrated its utility. This page reports on that testing and the results.

The notion of inferred average precision arises from thinking about average precision as the expected value in a random experiment. In this thought experiment a relevant document is chosen at random from a list and one asks what the probability of getting a relevant document at or above that rank is, i.e., what the expected precision at that rank is. The probability of getting relevant document at or above the rank corresponds to precision at the rank. Picking a relevant document at random corresponds to averaging these precisions over all relevant documents. More background and the derivation of the actual measure can be found in the following paper:

Estimating Average Precision with Incomplete and Imperfect Judgments
Emine Yilmaz and Javed A. Aslam. To appear in Proceedings of the
Fifteenth ACM International Conference on Information and Knowledge
Management (CIKM).  November, 2006.  

Here is a brief introduction to the inferred average precision, compliments of Yilmaz and Aslam.

What-if experiments with the TRECVID 2005 feature task

For TRECVID 2005, the submitted system results were pooled down to at least a depth of 200 items in the ranked results for each system run and the shots in those pools were manually judged - forming a base set of judgments for the experiments.

Four other sets of judgments were created by randomly marking 20%, 40%, 60% and 80% of the base judgments as "not judged", forming 80%, 60%, and 20% samples respectively of the base judgments.

All systems that submitted results for all features in 2005 were then evaluated for the base and each of the 4 sampled sets of judgments using the definition of infAP now built in to trec_eval. By that definition, infAP of a 100% sample of the base judgment set is identical to average precision (AP).

The results based on the sampled judgments were then compared to those based on the base judgments. InfAP scoring approximates AP scoring very closely. System rankings change very little when determined based on infAP versus AP.

Distribution of mean infAP and AP

What do the distributions of infAP run scores look using ever smaller samples of the base judgment set?

   		Min.    1st Qu. Median  Mean    3rd Qu. Max.
100% sample	0.0020  0.1145  0.1460  0.1583  0.2265  0.3190
 80% sample	0.0020  0.1125  0.1460  0.1580  0.2250  0.3220
 60% sample	0.002   0.112   0.147   0.159   0.226   0.320
 40% sample     0.0030  0.1135  0.1440  0.1563  0.2230  0.3190
 20% sample	0.0020  0.1020  0.1460  0.1500  0.2035  0.3100

How well do the infAP scores using ever smaller samples of the judgment set match the AP scores.

Kendall's tau

 
80% sample 0.9862658 
60% sample 0.9871663 
40% sample 0.9700546 
20% sample 0.951566 

Correlation coefficient (Pearsons's)

80% sample	0.9996353
60% sample	0.999618
40% sample	0.9991702
20% sample	0.995474

Residual standard error fitting infAP to AP

80% sample	0.002383
60% sample	0.002438
40% sample	0.003594
20% sample	0.008386

Statistically significant rank swaps

For each set of judgments, a randomization test was used detect statistically significant (p>0.01) differences among all pairs of runs. Then the sets of significant differences associated with each of sampled judgment sets was compared with that of the base set, counting the number of significantly different pairs that swap (i.e.reverse) order, are lost (can no longer be distinguished), remain the same, can now be distinguished. Here are the results:
              Swap	Lose	Keep	Add

80%  sample	0	 35	2018	37
60%  sample	0        57  	1996    36
40%  sample	0    	104  	1949    45
20%  sample	0    	170  	1883    73
Distribution of Ap versus infAP using 80 ssample of base judgments Distribution of Ap versus infAP using 60 ssample of base judgments Distribution of Ap versus infAP using 40 ssample of base judgments Distribution of Ap versus infAP using 20 ssample of base judgments

Conclusions

In order to encourage the use of generic methods for detector development, TRECVID 2006 decided to require submissions on 39 features, rather than the 10 or so required previously, but due to a fixed budget for judgments could only promise to evaluate 10 of the 39. However, given the results of the experiments described above, we decided to judge 20 features using a 50% random sample of the usual pools. This allowed us to get a better estimate of system performance by averaging over 20 rather than 10 features.
National Institute of
Standards and Technology Home Last updated: Thursday, 05-Oct-2006 09:21:10 MDT
Date created: Thursday, 07-Sep-06
For further information contact