IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 164 Laboratory tests: automatic Systems experimenter to disregard the problems of implementing his algorithms as components of a full-scale operational system, and concentrate his program- ming effort on achieving efficiency in his test programs, and thus a quick turnaround for each stage of the experiment. One often encounters the view (or faith) that, if the merits of the logic of the retrieval technique warrant use in a large-scale, real life system, ways will certainly be found to create an acceptable implementation, either through ingenious programming or by devising new hardware. This seems not unreasonable when one considers the history of computing. I should like to illustrate the nature of information retrieval laboratory programs with a simple example. Although this should not be regarded as a description of an actual program, it does draw upon ideas present in existing programs. Suppose that we wish to model a system that responds to a user's query with a ranked list of references to documents. The document ranking is to be determined by the sum of the weights of terms which are common to both query and document. For my present purpose, it is not important to specify how the weights have been derived; suffice it to say that they are available either in the document descriptions or the queries. One evaluation method, used by Sparck Jones2 involves aggregating, for all queries in the test, the numbers of relevant and non-relevant documents retrieved at each value of the matching function. (This leads to a type of microevaluation, in Rocchio's terminology32.) An operational system would require the capability of selecting enough or the highest ranking documents to satisfy the demands of the user, and O( presenting them in the correct order: a process which must be designed with some subtlety if it is not to use excessive computing resources. A laboratory model could do the task in a very different manner: ranking documents is unnecessary because there is no user to view them. A straightforward computer program can be designed along the following lines: (1) Set up a two-column table in core storage to be used for counting relevant and non-relevant documents (see Figure 9.3). (2) Place all the relevance judgements in core storage. (3) Place all the (numerically coded) document descriptions in core storage. (4) Proceed through the file of queries, performing the following operations for each query: Compare each document with the query to compute a matching value (which should be a positive whole number); this and the relevance relationship between the document and the query determine a position in the table, to which 1 is added. (5) When the last query has been processed, the completed table can be used to calculate figures for a recall/precision plot. Note that this, and nol retrieved references, is the primary output of the program. On most computers, the number of documents that can be handled in this way is rather limited, because all their descriptions are held in core storage while queries are processed. However, test collections usually have far fewer queries than documents: one does not often encounter experiments in the literature which use more than 250 queries. Thus, a more useful experimental computer program, which could deal with indefinitely large sets of documents9 can be built by simply substituting `query' for `document', and vice versa, in steps (3), (4) and (5) above. We now have a program which looks structurally I I U 1' I U q I q U I I I I I I U I I I I