IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 20 The methodology of information retrieval experiment The documents were 1400 real documents on the subject of aeronautics, selected rather than sampled. The 221 requests were obtained by asking the authors of selected published papers (`base documents') to reconstruct the questions which originally gave rise to these papers. The experimental design was quite simple: each query was searched against every system. Since the searching part of the system was controlled by simple rules, there was no problem in relation to replicating searches or the order in which the systems were tried. Measurements were made of relevance and of a number of explanatory variables. An attempt was made to obtain complete relevance judgements. The procedure adopted was as follows: students of the subject searched the entire document collection (starting with titles but consulting the full document if there seemed any possibility of relevance) against each of the requests. Documents selected by them as possibly relevant to any request were subject to final judgement by the author/requester, together with (a) the references given in the base documents, and (b) documents retrieved by one very different kind of retrieval technique. The analysis was chiefly directed at calculating recall and precision averages, and relating these to the variables built into the experiment (concerning the construction of the index language) and to various explanatory variables such as exhaustivity of indexing and specificity of the language. Medlars The object of the Medlars test was to evaluate the existing Medlars system and to find out ways in which it could be improved. A few variables were built into the experiment, notably the form of interaction between the user and the system, so that the results obtained with different forms of interaction could be compared; but the main feature of the test was a detailed analysis of the reasons for failure. The document collection was that currently available on the Medlars service, and consisted of about 700 000 items. 302 genuine queries were obtained by a form of stratified sampling. Because the requests were real ones, it was not possible to replicate searches with different forms of interaction between system and user. Hence the comparisons in relation to this variable had to be based on different request sets. Relevance judgements were provided by the requesters. Since in such a situation there could be no question of scanning the entire collection, the testers went to considerable effort to discover some relevant documents that had not been found by the system (such documents were necessary for the failure analysis). The sources for these documents were (a) those already known to the requester, and (b) documents found by Medlars staff through sources other than Medlars or Index Medicus. Thus each requester was asked to judge a sample of the output from the Medlars search, together with selected documents from other sources. After the relevance judgements had been obtained, the measurement process continued with an analysis of failures (non-relevant retrieved and relevant not retrieved). A classification of reasons for failure was devised. Cranfield 2 and Medlars are two of the classic experiments, both playing 1 i I I i U