IRE Information Retrieval Experiment Simulation, and simulation experiments chapter Michael D. Heine Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Examples of simulation models in information retrieval studies 189 to a document, to the real numbers. These values are each identifiable with a set of elementary conjuncts of the terms that make up Q. This mapping could be done by, for example, summing the non-negated terms that are attached in the document-the Cranfield `co-ordination level' function[OCRerr]r by forming the product of, or summing, the specificities of the non-negated terms attached to the document. If the DWF is such that more than one elementary conjunct maps to the same numeric value (which is certainly true for the co-ordination level function) the ordering of elementary conjuncts is a partial ordering. The outcome values of DWFs are of course the `document weights' of interest, each document being mapped to exactly one value. (The concept of `term weight' is a superfluous one and in the author's view better avoided, since it obscures the essential relationship of interest: the association of each document with one or more elementary conjuncts.) Our interest is thus in relating each of the retrieval strategies to values of Recall and Precision as the latter is determined by identification of selected elementary conjuncts of query terms. The ways in which this can be done become apparent if we specify all the elementary conjuncts and associate each such conjunct with a pair of probabilities: the probability that a relevant document is mapped by the function Q[OCRerr] Td to that conjunct, and the probability that a non-relevant document is mapped to that conjunct. (We should, strictly, refer to two functions defined by the expression Qc\ Td, since the domain differs in the two cases.) For N=4 the list is as in Table 10.1 where [OCRerr] ,[OCRerr] =1= [OCRerr]f. The Swetsian probability distributions are defined once each conjunct has been mapped to a real number by a DWF, the numbers not necessarily being distinct. (This is, at least, one interpretation of Swet's contribution. In the experimental results re-analysed by Swets, specific data sets such as the above were in fact confounded before the distributions were identified.) These distributions, as a pair, define a set of Recall-Precision co- ordinates, i.e. a graph, if a threshold value is allowed to explore the various outcome values and a document is retrieved only when its weight exceeds that threshold value. For a threshold value of T, the Recall and Fallout values are: 16 16 R = and F=Zf i=T with the Precision value following from P = G/[G + (1 - G)(F/R)]. Such graphs are defined by a particular instance of information need (evidenced through a partitioned database), a particular query (defined as a set of terms), and a DWF. Since the elementary conjuncts can be ranked (permuted) in (2[OCRerr])! ways, there are (2[OCRerr])! subsets of DWFs distinguishable through the rankings of elementary conjuncts that they effect, of those DWFs that yield 2[OCRerr] distinct outcome values. DWFs can also be classified by the Recall-Precision graphs that they effect, for a particular need and query. Each graph possesses at most 2[OCRerr] distinct R-P co-ordinates. (One says `at most' since if `(0,0)' values of (rj ,f) are present, the cumulative values of R and F will be unaffected by incrementation of the threshold past such pairs and so the R-P graph will lose one co-ordinate for each such pair. As N increases, the number of such probability pairs will increase.) The number of distinct R-P graphs is at most (2[OCRerr])! Possibly this is the exact number, but since Precision depends only on