IRE
Information Retrieval Experiment
Opportunities for testing with online systems
chapter
Elizabeth D. Barraclough
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
134 Opportunities for testing with online systems
The comprehensive search should be assessed by measuring recall and
precision. For this type of search in a real life system one would be attempting
to assess the actual performance of the system in its normal operating mode
against the capabilities of the system as exploited by the experienced
experimenter. The user, or his intermediary, would perform the search in the
normal way using whatever interactive tools he wished, e.g. iterating through
a series of formulations and prints to get a satisfactory search, finally ending
up with a formulation which would be deemed to be a correct expression of
his needs. The experimenter would provide a broad search on this topic
which would include the user's formulation as a subset and the user would be
asked to assess the full output. From this the precision of the user's
formulation can be established and an estimate of the recall from the overlap
between the full set and the user's subset.
The user who wishes to find a few references to provide an entry into a
subject has very different criteria for a satisfactory search. Precision is still of
interest but recall assumes a minor role, the most important aspect is finding
the references quickly and easily. For this type of search all the references
would be retrieved online, so a parallel search by the experimenter is not a
possibility. In this case analysis of the whole session data would prove very
valuable. One might wish to measure the number of references retrieved
before the first relevant one, also the number of formulations that were
necessary, with the number of relevant citations for each, before a satisfactory
formulation is reached.
A good example of the detailed analysis that needs to be done is the study
by de Jong-Hoffman3 of a single search on the INSPEC database. It provides
a lot of ideas for points that should be investigated in a wider study.
System effectiveness
The macroevaluation outlined above gives no indication of the reasons for
failure or success. In an online system the investigation of the reasons, or
microevaluation, can be carried out as an extension of the macro study. In
the first case, where a comparison was possible between the user's search and
the experimenter's, the differences between the two searches should be
investigated, i.e. the relevant references missed and the reasons for this. It is
here that the collection of the complete session data is of value. It can then be
seen which terms were omitted that in the broader search retrieved relevant
references and, in a more detailed investigation, how these terms were
omitted from the search and could have been found from the system.
Many other aspects of the system can be investigated by analysis of
complete session data collected as part of an evaluation test. For example,
the usefulness of commands can be determined by looking at the sequences
of commands in a session. One would expect a command showing related
terms in a dictionary or thesaurus to be followed by the selection of some, or
all, of those terms. The proportion of terms so chosen is a measure of the
value of the command. Collection of statistics of both command use and the
time taken for the execution of the commands can lead to proposals for the
improvement of the system.
An attempt has been made to point out some of the aspects of online
information retrieval systems that could lend themselves to testing and