IRE
Information Retrieval Experiment
Simulation, and simulation experiments
chapter
Michael D. Heine
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Examples of simulation models in information retrieval studies 185
that we created-and if important should be tested by experiment, the arbiter
of truth.
10.2 Examples of simulation models in information retrieval
studies
In order to demonstrate the variety of the `style' of the simulation approach
in information retrieval we describe three examples of simulation models
relating to it. Except for the first example, no details are given as to how the
models can be implemented on a computer, i.e. expressed as a sequence of
instructions. The first example serves to show how a widespread program
package, the Statistical Package for the Social Sciences (SPSS) can be used
for simple simulation purposes. The second example is a paraphrase of the
treatment of Morse's browsing model by Salton (Morse11, Salton1 2) The
third example represents a novel extension of the model put forward by
Swets13' 14 interpreted in a discrete formalism. The examples relate to three
very different areas in information retrieval: the speed with which documents
are supplied from a library network (through some given library); the
enhancement of `browsing' in a collection, achieved by relegating little-used
material from it; and the distribution of the effectiveness (expressed as a pair
of Recall-Precision values) of boolean search expressions input to a database,
when the terms of which the search expression is made up are given.
Example 1 (use of SPSS as a simple sImulation tool)
Orr has suggested the possibility of systematically measuring the speed of
supply of documents through a given library local to the user, where the
library is (as is usual) connected to one or more other libraries which can
supply documents not available locally1 5. Each item in a sample of
documents, allegedly a random sample of documents needed by the clients of
a given library, is assigned a `delivery time'. This is the time taken to supply
the item-whether from a library local to the user or from a `connected'
library. The delivery time is in fact a label for an interval into which the
actual time taken is placed, the intervals being approximately (1[OCRerr] - [OCRerr], 1O[OCRerr])
minutes, n= 1,2,3,4,5. (It is considered that these intervals correspond
reasonably closely to our subjective notions of document delivery time, which
a straight arithmetic scale does not.) Orr's approach is especially interesting
in that (a) it explicitly treats document delivery time as an indicator of library
effectiveness, (b) it gives a measure of overall effectiveness unbiased by an
existing pattern of demand (as distinct from need), and (c) it measures not
the effectiveness of a library `in isolation' but its effectiveness contingent on
the strength of its connections to other document sources and the extent of
those sources. The difficulties in applying the method appear to be,
principally, those of identifying a convincing sample design strategy, and
accommodating substitutibility of information demand into the method.
Denote document delivery time by TG (so that the possible values of TG
are 1, 2, 3, 4 and 5), and define a new variable as 125-25TG. The mean value
of the new variable is known as the `Capability Index' of the library
(contingent on a specified backup system), as defined by Orr, and is denoted