IRE Information Retrieval Experiment Opportunities for testing with online systems chapter Elizabeth D. Barraclough Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 134 Opportunities for testing with online systems The comprehensive search should be assessed by measuring recall and precision. For this type of search in a real life system one would be attempting to assess the actual performance of the system in its normal operating mode against the capabilities of the system as exploited by the experienced experimenter. The user, or his intermediary, would perform the search in the normal way using whatever interactive tools he wished, e.g. iterating through a series of formulations and prints to get a satisfactory search, finally ending up with a formulation which would be deemed to be a correct expression of his needs. The experimenter would provide a broad search on this topic which would include the user's formulation as a subset and the user would be asked to assess the full output. From this the precision of the user's formulation can be established and an estimate of the recall from the overlap between the full set and the user's subset. The user who wishes to find a few references to provide an entry into a subject has very different criteria for a satisfactory search. Precision is still of interest but recall assumes a minor role, the most important aspect is finding the references quickly and easily. For this type of search all the references would be retrieved online, so a parallel search by the experimenter is not a possibility. In this case analysis of the whole session data would prove very valuable. One might wish to measure the number of references retrieved before the first relevant one, also the number of formulations that were necessary, with the number of relevant citations for each, before a satisfactory formulation is reached. A good example of the detailed analysis that needs to be done is the study by de Jong-Hoffman3 of a single search on the INSPEC database. It provides a lot of ideas for points that should be investigated in a wider study. System effectiveness The macroevaluation outlined above gives no indication of the reasons for failure or success. In an online system the investigation of the reasons, or microevaluation, can be carried out as an extension of the macro study. In the first case, where a comparison was possible between the user's search and the experimenter's, the differences between the two searches should be investigated, i.e. the relevant references missed and the reasons for this. It is here that the collection of the complete session data is of value. It can then be seen which terms were omitted that in the broader search retrieved relevant references and, in a more detailed investigation, how these terms were omitted from the search and could have been found from the system. Many other aspects of the system can be investigated by analysis of complete session data collected as part of an evaluation test. For example, the usefulness of commands can be determined by looking at the sequences of commands in a session. One would expect a command showing related terms in a dictionary or thesaurus to be followed by the selection of some, or all, of those terms. The proportion of terms so chosen is a measure of the value of the command. Collection of statistics of both command use and the time taken for the execution of the commands can lead to proposals for the improvement of the system. An attempt has been made to point out some of the aspects of online information retrieval systems that could lend themselves to testing and