IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
I
The decade 1968-1978 241
(i.e. collection frequency) weighting of some utility. However neither
Yaswani and Cameron nor Sparck Jones, in substantial series of experiments,
could obtain real performance improvements with clustered rather than
unclustered terms. Cagan found clustering useful, but in a highly eccentric
test. In document clustering, precision can be maintained, but recall suffers.
Taken together, the tests would imply that simple statistical techniques are
as good as more elaborate ones, but even then yield only modest performance
improvements. The main inference to be drawn from the tests was that
vocabulary distribution properties may be important for retrieval: for
example Sparck Jones found clustering rare terms far more useful than
clustering common ones. This observation contributed to the work on
weighting.
Automafic methods using relevance informatlon
The final group of tests to be considered are those concerned with relevance
feedback and weighting. The automatic indexing methods discussed so far
are based on information about the occurrences and co-occurrences of terms
in any documents. The use of more specific information about term
occurrences and co-occurrences in relevant documents leads to the relevance
feedback and relevance weighting schemes which have been especially
important in the research work of the decade. The Smart Project's early
tests4' 63, 64, especially those of 1de109, concentrated on feedback methods
for adding terms to, or removing them from, queries; later experiments65' 66
were concerned with relevance or `precision' weighting. Sparck Jones has
carried out a series of experiments with relevance weights75-77' 95, as have
Harper and van Rijsbergen81. In an operational context, Miller68[OCRerr]70, the
UKCIS staff(Barkeretal.54' 71 and subsequently Robson and Longman72' 73)
have studied relevance-controlled query expansion or weighting schemes.
Cameron's approach was rather different, clustering documents using
relevance information74
The character of these experiments has been very like that of other
automatic indexing tests. Thus the focus of the tests has been a comparative
evaluation of searching with and without relevance information. The context
has nearly always been that of natural language indexing, though Miller
applied relevance weights to MeSH terms. The motivation ha8 been to
demonstrate the value of statistical methods of indexing utilizing relevance
information. An additional motivation in laboratory tests like those of
Robertson and Sparck Jones75 or Harper and van Rusbergen has been to
validate a formal theory; while in the service studies like the UKCIS ones,
the statistical feedback techniques have been seen as devices assisting the
user in reducing the effort of profile preparation.
In form the experiments follow the mainstream pattern with recall and
precision evaluation, but with output ordering in the laboratory tests where
the operational tests have concentrated on comparisons with boolean
performance. It is of interest that in this group of experiments quite large test
collections were used, not only in operational tests like Miller's, but in some
laboratory tests: thus Cameron used some 12000 documents (though few
requests), and Sparck Jones over 27 000 documents.
1£