Information Retrieval Experiment

IRE Information Retrieval Experiment Simulation, and simulation experiments chapter Michael D. Heine Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Examples of simulation models in information retrieval studies 187 system follows a policy of relegating items from it when the rate of usage of such items falls below some threshold value, and (2) usage is of a `browsing' nature. By `browsing' we mean a process whereby the user identifies relevant documents by examining documents (or records) chosen randomly (uni- formly) from the collection; at least the simulation represents what we term browsing in this manner. (Relegation of documents could in practice be undertaken using an `age of document' criterion, say, if it were found that relevant items (i.e. items used) were found to have a mean age differing significantly from the mean age of non-relevant items.) Browsing usage so defined is not distinguished from usage of other kinds-which of course constitutes a severe limitation of the model. The collection that `remains' after less-useful documents have been relegated is referred to as a `reconcentrated' collection. Our interest is in the enhancement to browsing achieved by such relegation. It was shown by Morse that the probability of identifying a relevant item, placed at random in a collection or database, is: P= l-exp(-[OCRerr]/n) where t is the time taken, [OCRerr] is the search rate, and N is the size of the database, and the enquirer searches randomly. (The analogy of most interest in information retrieval work in its narrower sense is perhaps where the relevant items are placed at random in a retrieved set of records, the size of this being proportional to the size of the parent file, it might be assumed.) Suppose that the collection or database is divided into a more-relevant section and a less-relevant section much as the MEDLARS database is divided into the MEDLINE file and a set of BACKFILEs; and denote the estimated mean numbers of items of relevance in the whole collection and the reconcentrated collection by E[OCRerr] and E[OCRerr]. Also denote the mean numbers of relevant items identified by searching the whole collection and the reconcentrated collection for a time t by S}t) and Sr(t). Then: [OCRerr]t) = E([OCRerr]l -exp (-6t/N)); and STh(t) = E[OCRerr](1 - exp (- [OCRerr]t/xN)) where xN is the size of the reconcentrated collection. Using simplifying assumptions it can then be deduced that the relationship between E, [OCRerr] and xis: Em = xE([OCRerr]l +ln (lIx)) The effectiveness of the subdivided collection, i.e. the effectiveness of choosing a value for x (other things being equal) may then be interpreted as either (a) the ratio sTh(t)/s}t) (the ratio of the numbers of items obtained in a given time from the reconcentrated collection and the main collection), or (b) the ratio of the Recall values, EThIE (which is necessarily less than 1). This type of model appears to be appropriate to both the problem of optimum online file size, and the problem of optimum local library size: each is an analogue of the other. The latter problem (discussed for example by Goret6, or the United Kingdom University Grants Committee17) has received particular attention in recent years in the area of academic librarianship with the enforced abandonment, for economic reasons, of the