IRE
Information Retrieval Experiment
Simulation, and simulation experiments
chapter
Michael D. Heine
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Examples of simulation models in information retrieval studies 187
system follows a policy of relegating items from it when the rate of usage of
such items falls below some threshold value, and (2) usage is of a `browsing'
nature. By `browsing' we mean a process whereby the user identifies relevant
documents by examining documents (or records) chosen randomly (uni-
formly) from the collection; at least the simulation represents what we term
browsing in this manner. (Relegation of documents could in practice be
undertaken using an `age of document' criterion, say, if it were found that
relevant items (i.e. items used) were found to have a mean age differing
significantly from the mean age of non-relevant items.) Browsing usage so
defined is not distinguished from usage of other kinds-which of course
constitutes a severe limitation of the model. The collection that `remains'
after less-useful documents have been relegated is referred to as a
`reconcentrated' collection. Our interest is in the enhancement to browsing
achieved by such relegation.
It was shown by Morse that the probability of identifying a relevant item,
placed at random in a collection or database, is:
P= l-exp(-[OCRerr]/n)
where t is the time taken, [OCRerr] is the search rate, and N is the size of the
database, and the enquirer searches randomly. (The analogy of most interest
in information retrieval work in its narrower sense is perhaps where the
relevant items are placed at random in a retrieved set of records, the size of
this being proportional to the size of the parent file, it might be assumed.)
Suppose that the collection or database is divided into a more-relevant section
and a less-relevant section much as the MEDLARS database is divided into
the MEDLINE file and a set of BACKFILEs; and denote the estimated
mean numbers of items of relevance in the whole collection and the
reconcentrated collection by E[OCRerr] and E[OCRerr]. Also denote the mean numbers of
relevant items identified by searching the whole collection and the
reconcentrated collection for a time t by S}t) and Sr(t). Then:
[OCRerr]t) = E([OCRerr]l -exp (-6t/N));
and
STh(t) = E[OCRerr](1 - exp (- [OCRerr]t/xN))
where xN is the size of the reconcentrated collection. Using simplifying
assumptions it can then be deduced that the relationship between E, [OCRerr] and
xis:
Em = xE([OCRerr]l +ln (lIx))
The effectiveness of the subdivided collection, i.e. the effectiveness of
choosing a value for x (other things being equal) may then be interpreted as
either (a) the ratio sTh(t)/s}t) (the ratio of the numbers of items obtained in a
given time from the reconcentrated collection and the main collection), or
(b) the ratio of the Recall values, EThIE (which is necessarily less than 1).
This type of model appears to be appropriate to both the problem of
optimum online file size, and the problem of optimum local library size: each
is an analogue of the other. The latter problem (discussed for example by
Goret6, or the United Kingdom University Grants Committee17) has
received particular attention in recent years in the area of academic
librarianship with the enforced abandonment, for economic reasons, of the