IRE
Information Retrieval Experiment
Simulation, and simulation experiments
chapter
Michael D. Heine
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
188 Simulation, and simulation experiments
goal of local self-sufficiency in libraries in favour of both (a) systematic stock
relegation and (b) heavier reliance on both local `closed access' collections
and more remote regional and national collections.
Example 3 (The `logical surface' and `document weighting surface' of a set of
terms)
The essence of information retrieval is that a person, in recognition of an
information need, perceives a set of document attributes that in hisjudgement
best distinguishes the documents relevant to that need from other documents.
Suppose our interest is in the overall sensitivity of the Recall-Precision
outcome of retrieval to (1) the choice of logical expression embodying a given
set of N attributes (e.g. terms), and (2) the choice of (a) document weighting
function, and (b) threshold value, for a set of N terms. We shall approach the
problems using the system representation of Swets, as extended by the
author18, and note that the general equivalence between the two retrieval
strategies just mentioned was first pointed out by Angione1 9. We note also
that the problems identified are subproblems of broader problems in which
the identity of the set of N terms is allowed to vary.
Suppose, for illustration, we take the value of N to be 4. (So subsequent
expressions such as 2[OCRerr] can be generalized to 2N[OCRerr]) The four terms comprising
the query are denoted by Ti, T2, T3 and T4; the query itself, construed as
such as set, by Q. The distinct elementary logical conjuncts of these terms are
2[OCRerr] in number, examples of same being:
TlAT2AT3AT4 or TlA[OCRerr]T2AT3A[OCRerr]T4
Here,' A `denotes conjunction and ` [OCRerr]` denotes negation. These 16 elementary
conjuncts may be disjoined (i.e. ORed) together in any combination.
Accordingly, for 4 search terms there are exactly16c0 + `6c1 + `6c2 + +
16C16, or 216=65536 distinct boolean expressions with which an enquirer
may probe a database. The elementary conjunct in which all search terms are
negated, namely [OCRerr]Tl A [OCRerr]T2 A [OCRerr]T3 A [OCRerr]T4, is unlikely to be employed in
practice, so that this total might reasonably be modified to 2(24-1) Rejection
of particular disjunctions of the elementary conjuncts other than the all-
negated one might also be reasonable, but experimentally obtained evidence
of user behaviour in this regard would be needed tojustify particular choices.
In the absence of such evidence it seems reasonable to proceed without
arbitrary rejection of any of the possible boolean expressions that might be
used[OCRerr]ther than those involving the all-negated elementary conjunct. Our
interest is in the probability distribution that this set of search expressions
defines over the Recall-Precision `area' (i.e. over the area (0,1) x (0,1)). This
will be the distribution on Recall-Precision outcomes when the form of the
boolean expression is chosen arbitrarily by the enquirer, but the component
terms of the expression are fixed. The surface will in general be specific to a
given instance of information need (defined objectively as a partitioning of
the data base) and a given query set Q.
Before taking the above further, we examine the second problem. A
document weighting function (DWF) acts so as to order, or partially order,
the elementary conjuncts that we have described. This is so since a DWF
serves to map the values of Q[OCRerr] Td, where Td denotes the set of terms attached
j