IRE
Information Retrieval Experiment
Simulation, and simulation experiments
chapter
Michael D. Heine
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
190 Simulation, and simulation experiments
the ratio of Fallout to Recall, one suspects that some Recall-Precision graphs
may fortuitously be identical. (Uncertainties of this nature are in fact an
argument for using simulation techniques: both to bypass establishment of
their truths by formal means, and to try to establish exceptions to general
statements.)
TABLE 10.1
Rank Elementary conjunct Probability for set of Probability for set of non.
value relevant documents relevant documents
a TTA T2A T3A T4 r fa
b TiA T2A T3A7T4 rb fb
C TiA T2A[OCRerr]T3A T4 r[OCRerr] fc
d TiA T2A7T3A[OCRerr]T4 rd fd
e T1A[OCRerr]T2A T3A T4 r[OCRerr] fe
f T1A7T2A T3A7T4 rf ff
g T1A[OCRerr]T2A7T3A T4 rg
h T1A[OCRerr]T2A[OCRerr]T3A7T4 rh fh
i [OCRerr]TlA T2A T3A T4 r f
j 7T1A T2A T3A[OCRerr]T4 f
k 7T1A T2A7T3A T4 rk fk
I 7T1A T2A[OCRerr]T3A7T4 rL f
m [OCRerr]A[OCRerr]T2A T3A T4 r,,, fm
n 7TlA7T2A T3A7T4 r[OCRerr]
0 [OCRerr]TlA7T2A7T3A T4 r0
p [OCRerr]TlAThT2A[OCRerr]T3A7T4
We now return to the first problem. We can systematically identify all
Boolean expressions by taking combinations of rows from Table 10.1, the
expression in each row being ORed together. Each such combination will be
associated with values of Recall and Fallout, obtained by summing the
values of [OCRerr] andf in the rows, respectively. A value of Precision follows once
G is specified. The pair of Recall and Precision values then contributes to the
distribution of interest. For example, the boolean search expression:
(Ti A T2 A T3 A T4) V (Ti A T2 A T3 A [OCRerr]T4) V (Ti A T2 A [OCRerr]T3 A
~T4)
will be associated with Recall and Fallout values of R = ra + rb + rd and F=
fa +[OCRerr] +fd[OCRerr] and a consequential P value. An algorithm to generate the `logical
surface' of a set of terms is thus:
(1) Read the 2N pairs of probability values {(r[OCRerr],j[OCRerr]i= i, 2N}, obtained from an
experiment, into a 2N x 2 array. Read a value for G.
(2) Define a threshold value equal to 2N, with J, an integer variable,
initialized to 0.
(3) Define every combination of array rows of size J+ 1. (There are 2NCj+1
such combinations.) For each combination sum the r[OCRerr] values to form R
and sum thef values to form F. Infer P, and place the resulting (R, P) c[OCRerr]
ordinate into a cell of a grid defined over (0,1) x (0,1). This could
conveniently be a 60 x 60 grid, for example.
(4) Increment J by 1, and if J< 2N return to step 2.
(5) Divide each total of c[OCRerr]ordinates in the cell grid by the grand total of
(R, P) co-ordinates, i.e. by 22N.
I;