IRE
Information Retrieval Experiment
Laboratory tests: automatic systems
chapter
Robert N. Oddy
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
164 Laboratory tests: automatic Systems
experimenter to disregard the problems of implementing his algorithms as
components of a full-scale operational system, and concentrate his program-
ming effort on achieving efficiency in his test programs, and thus a quick
turnaround for each stage of the experiment. One often encounters the view
(or faith) that, if the merits of the logic of the retrieval technique warrant use
in a large-scale, real life system, ways will certainly be found to create an
acceptable implementation, either through ingenious programming or by
devising new hardware. This seems not unreasonable when one considers the
history of computing.
I should like to illustrate the nature of information retrieval laboratory
programs with a simple example. Although this should not be regarded as a
description of an actual program, it does draw upon ideas present in existing
programs. Suppose that we wish to model a system that responds to a user's
query with a ranked list of references to documents. The document ranking
is to be determined by the sum of the weights of terms which are common to
both query and document. For my present purpose, it is not important to
specify how the weights have been derived; suffice it to say that they are
available either in the document descriptions or the queries. One evaluation
method, used by Sparck Jones2 involves aggregating, for all queries in the
test, the numbers of relevant and non-relevant documents retrieved at each
value of the matching function. (This leads to a type of microevaluation, in
Rocchio's terminology32.)
An operational system would require the capability of selecting enough or
the highest ranking documents to satisfy the demands of the user, and O(
presenting them in the correct order: a process which must be designed with
some subtlety if it is not to use excessive computing resources. A laboratory
model could do the task in a very different manner: ranking documents is
unnecessary because there is no user to view them. A straightforward
computer program can be designed along the following lines:
(1) Set up a two-column table in core storage to be used for counting relevant
and non-relevant documents (see Figure 9.3).
(2) Place all the relevance judgements in core storage.
(3) Place all the (numerically coded) document descriptions in core storage.
(4) Proceed through the file of queries, performing the following operations
for each query: Compare each document with the query to compute a
matching value (which should be a positive whole number); this and the
relevance relationship between the document and the query determine a
position in the table, to which 1 is added.
(5) When the last query has been processed, the completed table can be used
to calculate figures for a recall/precision plot. Note that this, and nol
retrieved references, is the primary output of the program.
On most computers, the number of documents that can be handled in this
way is rather limited, because all their descriptions are held in core storage
while queries are processed. However, test collections usually have far fewer
queries than documents: one does not often encounter experiments in the
literature which use more than 250 queries. Thus, a more useful experimental
computer program, which could deal with indefinitely large sets of documents9
can be built by simply substituting `query' for `document', and vice versa, in
steps (3), (4) and (5) above. We now have a program which looks structurally
I
I
U
1'
I
U
q
I
q
U
I
I
I
I
I
I
U
I
I
I
I