IRE
Information Retrieval Experiment
Simulation, and simulation experiments
chapter
Michael D. Heine
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Examples of simulation models in information retrieval studies 189
to a document, to the real numbers. These values are each identifiable with
a set of elementary conjuncts of the terms that make up Q. This mapping
could be done by, for example, summing the non-negated terms that are
attached in the document-the Cranfield `co-ordination level' function[OCRerr]r
by forming the product of, or summing, the specificities of the non-negated
terms attached to the document. If the DWF is such that more than one
elementary conjunct maps to the same numeric value (which is certainly true
for the co-ordination level function) the ordering of elementary conjuncts is
a partial ordering. The outcome values of DWFs are of course the `document
weights' of interest, each document being mapped to exactly one value. (The
concept of `term weight' is a superfluous one and in the author's view better
avoided, since it obscures the essential relationship of interest: the association
of each document with one or more elementary conjuncts.)
Our interest is thus in relating each of the retrieval strategies to values of
Recall and Precision as the latter is determined by identification of selected
elementary conjuncts of query terms. The ways in which this can be done
become apparent if we specify all the elementary conjuncts and associate
each such conjunct with a pair of probabilities: the probability that a relevant
document is mapped by the function Q[OCRerr] Td to that conjunct, and the
probability that a non-relevant document is mapped to that conjunct. (We
should, strictly, refer to two functions defined by the expression Qc\ Td, since
the domain differs in the two cases.) For N=4 the list is as in Table 10.1
where [OCRerr] ,[OCRerr] =1= [OCRerr]f. The Swetsian probability distributions are defined once
each conjunct has been mapped to a real number by a DWF, the numbers not
necessarily being distinct. (This is, at least, one interpretation of Swet's
contribution. In the experimental results re-analysed by Swets, specific data
sets such as the above were in fact confounded before the distributions were
identified.) These distributions, as a pair, define a set of Recall-Precision co-
ordinates, i.e. a graph, if a threshold value is allowed to explore the various
outcome values and a document is retrieved only when its weight exceeds
that threshold value. For a threshold value of T, the Recall and Fallout values
are:
16 16
R = and F=Zf
i=T
with the Precision value following from P = G/[G + (1 - G)(F/R)]. Such graphs
are defined by a particular instance of information need (evidenced through
a partitioned database), a particular query (defined as a set of terms), and a
DWF. Since the elementary conjuncts can be ranked (permuted) in (2[OCRerr])!
ways, there are (2[OCRerr])! subsets of DWFs distinguishable through the rankings
of elementary conjuncts that they effect, of those DWFs that yield 2[OCRerr] distinct
outcome values. DWFs can also be classified by the Recall-Precision graphs
that they effect, for a particular need and query. Each graph possesses at
most 2[OCRerr] distinct R-P co-ordinates. (One says `at most' since if `(0,0)' values
of (rj ,f) are present, the cumulative values of R and F will be unaffected by
incrementation of the threshold past such pairs and so the R-P graph will
lose one co-ordinate for each such pair. As N increases, the number of such
probability pairs will increase.) The number of distinct R-P graphs is at most
(2[OCRerr])! Possibly this is the exact number, but since Precision depends only on