IRE
Information Retrieval Experiment
Simulation, and simulation experiments
chapter
Michael D. Heine
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Examples of simulation models in information retrieval studies 191
(6) Plot the resulting surface, find the centre of mass, etc.
The algorithm can be modified to delete (R, P) points that arise from
combinations involving the all-negated elementary conjunct, as mentioned
previously, in which case the grand total changes. We note also that it is
doubtful if the algorithm could easily be implemented for N in excess of 4,
due to the combinatorial explosion in number of distinct search expressions.
(For N=5 this number is 232=4.3 x 10[OCRerr].) So what the algorithm produces is
the probability distribution on the R-P outcome space when a user of a
database selects a search expression arbitrarily, for some given set of search
terms.
We turn now to the second problem. Suppose that we restrict attention in
the first place to the set of DWFs that each map the 16 elementary conjuncts
to 16 distinct real numbers. These DWFs can be distinguished and classified
by the rankings (permutations) of the elementary conjuncts that they effect,
as previously mentioned, there being 16! classes for DWFs of this type. Our
interest is in the hypothetical situation of an enquirer (1) choosing a DWF
randomly from one of these classes, and (2) choosing a threshold value
randomly from the values 1,2,3,. . . , 16. An algorithm to generate the
probability distribution over the Recall-Precision outcome space for this
situation (for this species of DWF, for a given instance of information need,
and for a given query qua set of terms) is as follows:
(1) Define a permutation of the elementary conjuncts. Set J equal to 1. Read
G.
(2) Define R= [OCRerr] r[OCRerr], F= Z[OCRerr]1=6[OCRerr]f Infer P
(3) Put the resulting (R,P) co-ordinate into a cell of a grid defined over
(0,1) x (0,1).
(4) Increment J by 1. If J< 2N then go to step 2 else define a new permutation
and go to step 2. (If no new permutation is possible go to the next step.)
(5) Divide each total of co-ordinates in the cell grid by the grand total of
(R,P) points, i.e. by 2N(2N!).
(6) Plot the resulting surface, find the centre of mass, etc.
The surface could be termed the `document weighting surface' of a set of
terms. As noted before, permutations that throw the all-negated elementary
conjunct to rank positions other than 1 might be discounted. Also, we could
ignore (R,P) points arising from threshold values of 1 if we wished since,
unless r[OCRerr] 0, this relates to retrieval of the entire database. (It is recognized
as unlikely that an enquirer will require a Recall value of 1.) More
complicated algorithms might be identified that produce the Recall-Precision
probability surface for DWFs that only weakly order the elementary
conjuncts.
The three examples we have given are intended to represent widely
different approaches to simulation in information retrieval study. The first
was concerned with the `random number' simulation technique serving in
that case to show how the effects of choosing different policy options affecting
document delivery speed in 6ne type of information environment could be
predicted and compared. The second example served to emphasize the
necessity of imposing clear definitions on everyday words[OCRerr]in that case
`browsing'-for simulation to be undertaken at all, and perhaps also