IRE Information Retrieval Experiment Simulation, and simulation experiments chapter Michael D. Heine Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 190 Simulation, and simulation experiments the ratio of Fallout to Recall, one suspects that some Recall-Precision graphs may fortuitously be identical. (Uncertainties of this nature are in fact an argument for using simulation techniques: both to bypass establishment of their truths by formal means, and to try to establish exceptions to general statements.) TABLE 10.1 Rank Elementary conjunct Probability for set of Probability for set of non. value relevant documents relevant documents a TTA T2A T3A T4 r fa b TiA T2A T3A7T4 rb fb C TiA T2A[OCRerr]T3A T4 r[OCRerr] fc d TiA T2A7T3A[OCRerr]T4 rd fd e T1A[OCRerr]T2A T3A T4 r[OCRerr] fe f T1A7T2A T3A7T4 rf ff g T1A[OCRerr]T2A7T3A T4 rg h T1A[OCRerr]T2A[OCRerr]T3A7T4 rh fh i [OCRerr]TlA T2A T3A T4 r f j 7T1A T2A T3A[OCRerr]T4 f k 7T1A T2A7T3A T4 rk fk I 7T1A T2A[OCRerr]T3A7T4 rL f m [OCRerr]A[OCRerr]T2A T3A T4 r,,, fm n 7TlA7T2A T3A7T4 r[OCRerr] 0 [OCRerr]TlA7T2A7T3A T4 r0 p [OCRerr]TlAThT2A[OCRerr]T3A7T4 We now return to the first problem. We can systematically identify all Boolean expressions by taking combinations of rows from Table 10.1, the expression in each row being ORed together. Each such combination will be associated with values of Recall and Fallout, obtained by summing the values of [OCRerr] andf in the rows, respectively. A value of Precision follows once G is specified. The pair of Recall and Precision values then contributes to the distribution of interest. For example, the boolean search expression: (Ti A T2 A T3 A T4) V (Ti A T2 A T3 A [OCRerr]T4) V (Ti A T2 A [OCRerr]T3 A ~T4) will be associated with Recall and Fallout values of R = ra + rb + rd and F= fa +[OCRerr] +fd[OCRerr] and a consequential P value. An algorithm to generate the `logical surface' of a set of terms is thus: (1) Read the 2N pairs of probability values {(r[OCRerr],j[OCRerr]i= i, 2N}, obtained from an experiment, into a 2N x 2 array. Read a value for G. (2) Define a threshold value equal to 2N, with J, an integer variable, initialized to 0. (3) Define every combination of array rows of size J+ 1. (There are 2NCj+1 such combinations.) For each combination sum the r[OCRerr] values to form R and sum thef values to form F. Infer P, and place the resulting (R, P) c[OCRerr] ordinate into a cell of a grid defined over (0,1) x (0,1). This could conveniently be a 60 x 60 grid, for example. (4) Increment J by 1, and if J< 2N return to step 2. (5) Divide each total of c[OCRerr]ordinates in the cell grid by the grand total of (R, P) co-ordinates, i.e. by 22N. I;