IRE
Information Retrieval Experiment
Laboratory tests of manual systems
chapter
E. Michael Keen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
144 Laboratory tests of manual Systems
relations) as improvers of precision have a small and minority influence
(ISILT, Farradane1 3).
(9) The index language vocabulary has a minor influence on performance
compared with query negotiation, searching and indexing (Cranfield 1,
Cranfield 2).
(10) A pre-coordinate file requires significantly more search effort and time
to reach a given recall compared with a post-coordinate file (ISILT).
(11) Preservation of entry context allows significant rejection of non-relevant
entries for very little recall loss (ISILT, EPSILON).
(12) Use of direct entry significantly reduces search time and effort: the
indirect entry ofchain procedure subject headings (as British Technology
Index) has these penalties, for example (EPSILON).
(13) The varieties of function word provision and term order (e.g. in KWAC,
articulated, PRECIS) perform indistinguishably (EPSILON).
It may be added that operational testing also adds its weight to these findings:
for example, MEDLARS28 bears out number (9), and WUSCS22 bears out
numbers (5), (6), (7), and (13).
Measuring information retrieval system characteristics
Conclusions and findings about information retrieval cannot be generally
utilized unless measured relationships can be established between the
variables studied and performance. For example, the best choice of indexing
and index language as to term specificity-where users want a good precision
rati[OCRerr][OCRerr]needs a generalizable measure of specificity to replace the emotive
`named' index languages that usually figure in tests. A suitable measure has
proved hard to find: indexing exhaustivity is a little easier, with Cranfield 2
testing five levels and showing that 33 terms per document was the best in
that test environment4. For specificity in Cranfield 2 the first crude measure
was that of vocabulary size4, with large sizes taken to be more specific, but
ignoring the influence of term use in indexing and searching that might well
overlay the effect of size. A somewhat better measure was devised for the
later ISILT test, where measures of specificity were related to the outcome of
usage of the terms in indexing and searching, namely, measures based on size
of retrieval output. But in ISILT only having three comparable index
languages hardly revealed an interpolated optimum, so this approach was
reapplied to the Cranfield 2 data on 29 index languages. Figure 8.4 gives the
resulting plot of specificity versus precision (taken from Keen and Digger9).
The connecting lines represent logical links between the different index
languages: they are directions in which performance could be altered by
varying the specificity of indexing or searching. Overall optimum specificity
is that of language 13, single term word stems. Within the concept (phrase)
languages there is a fall in precision either side of 1112, simple concepts with
complete species from hierarchy. This measure of specificity is not the last
word on the matter, and still better measures need to be devised.
Measurement of cross-reference provision (linkage) was plotted against
performance in ISILT. Search breadth also needs measuring beyond the
crude use of co-ordination levels. The development of reliable and generally
applicable systems characteristics measures would remove the need to test
i
I
i
i
I