IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 240 Retrieval system tests l958[OCRerr]1978 Smart tests. It is worth noticing that the statistical techniques have beeii increasingly seen as means for improving any natural language input, rather than as tools for totally automatic as opposed to manual indexing and searching. Thus the motivation for weighting experiments has offen been to show that simple natural language keyword indexing, regarded as itself having been shown to be competitive with controlled language indexing, can be improved by the application of devices using statistical information, Throughout the assumption has been that statistical techniques pick up or effectively exploit information neglected or inadequately handled by the human indexer or searcher. The form of these experiments is well illustrated by Smart ones. They were generally characterized by small request and document sets (very often the test collections of earlier projects like Cranfield 2), and in evaluation by the use of recall and precision graphs for their ordered search output. The fact that recall/precision graphs may be obtained by different techniques must be emphasized, as this may explain large apparent differences between projects. Van Rijsbergen has utilized a measure of effectiveness combining recall and precision. Cagan used logical rather than real recall and precision. Evans' test was of a conventional kind using boolean profiles, and rather larger document sets than the studies of fully-automatic methods. Overall, the results of the different tests have been very similar. The detailed findings show considerable variation in performance for the different options tested in the more extensive comparisons like those made by Vaswani and Cameron, Sparck Jones, and the Smart Project. Unfortunately, the fact that the recall/precision graphs produced were obtained by different techniques means that specific comparisons between projects are impossible, and even comparisons between the relative ranges of performance have to be made with caution. Moreover Sparck Jones and Salton each conducted such large series of tests that it is very difficult to describe them briefly. We may therefore simply note that for vocabulary selection and weighting both Salton and Sparck Jones found performance differences with recall/precision graph6' ranging from 5 to 20 per cent, which were usually also improvements over simple term matching graphs. For term classification Vaswani and Cameron, using cutoff ranked output, found classification methods ranging from 16.4 to 21.3 per cent precision with 37.3[OCRerr]9.l per cent relative recall, compared with 21.4 and 49.1 per cent for keywords alone; the Smart Project's test with classes and phrases show small performance graph differences and improve- ments; Sparck Jones found variations of as much as several hundred per cent between best and worst classification graphs. Evans found that automatic assignment with a well-organized thesaurus could provide quite competitive performance. Van Rijsbergen for document clustering found small improve- ments for clusters at the best end of the performance range, representing gains in precision but loss of recall; these were rather better results than those obtained by Smart. A noticeable feature of the tests is that among the better performance options, similar performance may be obtained by a variety of approaches. Collectively the authors concerned interpreted their findings as showing that vocabulary weighting is more effective than selection or clustering. Selection may impact recall (a point concealed in the Smart type of performance representation), and is no better than weighting, and both Smart workers and Sparck Jones have found inverse document frequency