IRE
Information Retrieval Experiment
The Smart environment for retrieval system evaluation-advantages and problem areas
chapter
Gerard Salton
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
322 The [OCRerr]mart environment for retrieval system evaluation
Suffice it to say that a large number of fully automatic retrieval techniques
were identified which appeared to be competitive with the more conventional,
manual indexing procedures and the inverted file technologies that are
conventionally used. Large-scale improvements appeared possible by using
the iterative relevance feedback process to reformulate the search requests,
(md no substantial deterioration results from extending the operations to
alien environments such as foreign language materials.
When the automatic procedures incorporated into the Smart system were
compared with the manual analysis methodologies used by the Medlars
retrieval service at the National Library of Medicine, it was found that the
Smart indexing process based on the use of a stored thesaurus produced
retrieval results approximately equivalent in terms of recall and precision to
those obtainable with Medlars. Using a variety of enhancements such as the
automatic relevance feedback procedure, advantages of about 30 per cent
could be produced for the automatic Smart system compared to the
conventional Medlars process.
Those results turned out to have little immediate impact on operational
information retrieval, largely because of the difficulty of rendering believable
test results obtained with sample collections of a few hundred documents
when the operational environments include several million items. Additional
problems are posed by the enormous investments already made in the
(ivailable commercial systems which make it impossible to contemplate a
complete retooling of the kind involved in introducing language analysis
methods based on the availability of document abstracts and new file
organization methods.
More fundamental complaints were also voiced about the methodologies
incorporated into the Smart evaluation system. One of these concerned the
necessity to utilize relevance assessments of documents with respect to
queries in order to compute recall and precision values Large-scale studies
were made of the relevance assessment process leading to the conclusion that
relevance assessments of given documents with respect to particular queries
were generally unreliable and not extendible to different system users. Hence
it was argued that recall and precision values obtained by averaging the
search results over 40 user queries were valid only for the 40 users whose
relevance judgements were actually involved22'23
Lventually it became necessary to perform a complete study of the question
by using a variety of different user populations rendering relevance
assessments for the same document collections with respect to the same user
queries. It then became clear that the recall-precision results could be
expected to remain reasonably invariant with different user populations even
though the individual assessments would differ widely24. It was found that
substantial agreement existed among groups of assessors for documents
retrieved early in a given search that exhibit substantial similarities with the
user queries. Those documents are precisely the ones that largely control the
shape of the recall-precision curve. There is little agreement for items that
are less similar to the queries which therefore appear low down on the output
lists; but these documents carry little importance for overall system
performance.
Many other objections can be raised about laboratory tests of retrieval
systems: