IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
240 Retrieval system tests l958[OCRerr]1978
Smart tests. It is worth noticing that the statistical techniques have beeii
increasingly seen as means for improving any natural language input, rather
than as tools for totally automatic as opposed to manual indexing and
searching. Thus the motivation for weighting experiments has offen been to
show that simple natural language keyword indexing, regarded as itself
having been shown to be competitive with controlled language indexing, can
be improved by the application of devices using statistical information,
Throughout the assumption has been that statistical techniques pick up or
effectively exploit information neglected or inadequately handled by the
human indexer or searcher.
The form of these experiments is well illustrated by Smart ones. They were
generally characterized by small request and document sets (very often the
test collections of earlier projects like Cranfield 2), and in evaluation by the
use of recall and precision graphs for their ordered search output. The fact
that recall/precision graphs may be obtained by different techniques must be
emphasized, as this may explain large apparent differences between projects.
Van Rijsbergen has utilized a measure of effectiveness combining recall and
precision. Cagan used logical rather than real recall and precision. Evans'
test was of a conventional kind using boolean profiles, and rather larger
document sets than the studies of fully-automatic methods.
Overall, the results of the different tests have been very similar. The
detailed findings show considerable variation in performance for the different
options tested in the more extensive comparisons like those made by Vaswani
and Cameron, Sparck Jones, and the Smart Project. Unfortunately, the fact
that the recall/precision graphs produced were obtained by different
techniques means that specific comparisons between projects are impossible,
and even comparisons between the relative ranges of performance have to be
made with caution. Moreover Sparck Jones and Salton each conducted such
large series of tests that it is very difficult to describe them briefly. We may
therefore simply note that for vocabulary selection and weighting both Salton
and Sparck Jones found performance differences with recall/precision graph6'
ranging from 5 to 20 per cent, which were usually also improvements over
simple term matching graphs. For term classification Vaswani and Cameron,
using cutoff ranked output, found classification methods ranging from 16.4
to 21.3 per cent precision with 37.3[OCRerr]9.l per cent relative recall, compared
with 21.4 and 49.1 per cent for keywords alone; the Smart Project's test with
classes and phrases show small performance graph differences and improve-
ments; Sparck Jones found variations of as much as several hundred per cent
between best and worst classification graphs. Evans found that automatic
assignment with a well-organized thesaurus could provide quite competitive
performance. Van Rijsbergen for document clustering found small improve-
ments for clusters at the best end of the performance range, representing
gains in precision but loss of recall; these were rather better results than those
obtained by Smart. A noticeable feature of the tests is that among the better
performance options, similar performance may be obtained by a variety of
approaches. Collectively the authors concerned interpreted their findings as
showing that vocabulary weighting is more effective than selection or
clustering. Selection may impact recall (a point concealed in the Smart type
of performance representation), and is no better than weighting, and both
Smart workers and Sparck Jones have found inverse document frequency