IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 144 Laboratory tests of manual Systems relations) as improvers of precision have a small and minority influence (ISILT, Farradane1 3). (9) The index language vocabulary has a minor influence on performance compared with query negotiation, searching and indexing (Cranfield 1, Cranfield 2). (10) A pre-coordinate file requires significantly more search effort and time to reach a given recall compared with a post-coordinate file (ISILT). (11) Preservation of entry context allows significant rejection of non-relevant entries for very little recall loss (ISILT, EPSILON). (12) Use of direct entry significantly reduces search time and effort: the indirect entry ofchain procedure subject headings (as British Technology Index) has these penalties, for example (EPSILON). (13) The varieties of function word provision and term order (e.g. in KWAC, articulated, PRECIS) perform indistinguishably (EPSILON). It may be added that operational testing also adds its weight to these findings: for example, MEDLARS28 bears out number (9), and WUSCS22 bears out numbers (5), (6), (7), and (13). Measuring information retrieval system characteristics Conclusions and findings about information retrieval cannot be generally utilized unless measured relationships can be established between the variables studied and performance. For example, the best choice of indexing and index language as to term specificity-where users want a good precision rati[OCRerr][OCRerr]needs a generalizable measure of specificity to replace the emotive `named' index languages that usually figure in tests. A suitable measure has proved hard to find: indexing exhaustivity is a little easier, with Cranfield 2 testing five levels and showing that 33 terms per document was the best in that test environment4. For specificity in Cranfield 2 the first crude measure was that of vocabulary size4, with large sizes taken to be more specific, but ignoring the influence of term use in indexing and searching that might well overlay the effect of size. A somewhat better measure was devised for the later ISILT test, where measures of specificity were related to the outcome of usage of the terms in indexing and searching, namely, measures based on size of retrieval output. But in ISILT only having three comparable index languages hardly revealed an interpolated optimum, so this approach was reapplied to the Cranfield 2 data on 29 index languages. Figure 8.4 gives the resulting plot of specificity versus precision (taken from Keen and Digger9). The connecting lines represent logical links between the different index languages: they are directions in which performance could be altered by varying the specificity of indexing or searching. Overall optimum specificity is that of language 13, single term word stems. Within the concept (phrase) languages there is a fall in precision either side of 1112, simple concepts with complete species from hierarchy. This measure of specificity is not the last word on the matter, and still better measures need to be devised. Measurement of cross-reference provision (linkage) was plotted against performance in ISILT. Search breadth also needs measuring beyond the crude use of co-ordination levels. The development of reliable and generally applicable systems characteristics measures would remove the need to test i I i i I