Information Retrieval Experiment

IRE Information Retrieval Experiment Simulation, and simulation experiments chapter Michael D. Heine Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 188 Simulation, and simulation experiments goal of local self-sufficiency in libraries in favour of both (a) systematic stock relegation and (b) heavier reliance on both local `closed access' collections and more remote regional and national collections. Example 3 (The `logical surface' and `document weighting surface' of a set of terms) The essence of information retrieval is that a person, in recognition of an information need, perceives a set of document attributes that in hisjudgement best distinguishes the documents relevant to that need from other documents. Suppose our interest is in the overall sensitivity of the Recall-Precision outcome of retrieval to (1) the choice of logical expression embodying a given set of N attributes (e.g. terms), and (2) the choice of (a) document weighting function, and (b) threshold value, for a set of N terms. We shall approach the problems using the system representation of Swets, as extended by the author18, and note that the general equivalence between the two retrieval strategies just mentioned was first pointed out by Angione1 9. We note also that the problems identified are subproblems of broader problems in which the identity of the set of N terms is allowed to vary. Suppose, for illustration, we take the value of N to be 4. (So subsequent expressions such as 2[OCRerr] can be generalized to 2N[OCRerr]) The four terms comprising the query are denoted by Ti, T2, T3 and T4; the query itself, construed as such as set, by Q. The distinct elementary logical conjuncts of these terms are 2[OCRerr] in number, examples of same being: TlAT2AT3AT4 or TlA[OCRerr]T2AT3A[OCRerr]T4 Here,' A `denotes conjunction and ` [OCRerr]` denotes negation. These 16 elementary conjuncts may be disjoined (i.e. ORed) together in any combination. Accordingly, for 4 search terms there are exactly16c0 + `6c1 + `6c2 + + 16C16, or 216=65536 distinct boolean expressions with which an enquirer may probe a database. The elementary conjunct in which all search terms are negated, namely [OCRerr]Tl A [OCRerr]T2 A [OCRerr]T3 A [OCRerr]T4, is unlikely to be employed in practice, so that this total might reasonably be modified to 2(24-1) Rejection of particular disjunctions of the elementary conjuncts other than the all- negated one might also be reasonable, but experimentally obtained evidence of user behaviour in this regard would be needed tojustify particular choices. In the absence of such evidence it seems reasonable to proceed without arbitrary rejection of any of the possible boolean expressions that might be used[OCRerr]ther than those involving the all-negated elementary conjunct. Our interest is in the probability distribution that this set of search expressions defines over the Recall-Precision `area' (i.e. over the area (0,1) x (0,1)). This will be the distribution on Recall-Precision outcomes when the form of the boolean expression is chosen arbitrarily by the enquirer, but the component terms of the expression are fixed. The surface will in general be specific to a given instance of information need (defined objectively as a partitioning of the data base) and a given query set Q. Before taking the above further, we examine the second problem. A document weighting function (DWF) acts so as to order, or partially order, the elementary conjuncts that we have described. This is so since a DWF serves to map the values of Q[OCRerr] Td, where Td denotes the set of terms attached j