IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. I)ecision 4: [OCRerr]h[OCRerr]i [OCRerr]j'j f)I[OCRerr]C [OCRerr]() Li[OCRerr][OCRerr] *[OCRerr] 71 Collection coverage The subjects of the documents described by a database, their age, language, scope, medium and source can all, Potentially, affect the measures of retrieval performance. Hence, one must either use a collection which is homogeneo[OCRerr]5 with respect to these attributes and then claim results only for this limited sphere or attempt to randomize the collection with respect to some or all of them. Some form of randomized selection, even within a narrow boundary, is essential. This eliminates a Possibly unconscious bias of the experimenter in selecting the documents. For example, if documents were to be from the computer science field and Published during the past three years in English, a random selection from an existing bibliographic database such as Computing Reviews or Computer and Control Abstracts could be used. Tables of random numbers are useful in making the selection, either to select document numbers if items are numbered or to select pages and line number within the page if they are not. Form of surrogate The form of the document surrogates, whether citations only or citations with index terms, abstracts, full text, etc., should be appropriate to the hypothesi5 under test. Also, form of output Presented to a user affects relevance judgement5 It is essential, therefore, that all entries in the database be in the same form. Also, if real users, with real information needs, are involved in the experiment, there should be access to the full text of the documents themselves, if only for `public relations'. One might not wish, however, to make decisions about record form solely on the basis of Present needs. Because of the expense of setting up an experimental database, consideration should be given to future use of it, both by the investigator and by others. If additional fields can be input at very little added cost, fields which have a high Probability of being useful in later [OCRerr]Periments, it often saves time to include them, Particularly if short. Also, t is useful to include one or two blank fields in a computer record, which can )e assigned later. [OCRerr]aractefistics of the indexing .[OCRerr]documents are indexed using a number of different languages, how will the Ivestigator ensure that Parallel index records in different languages cover ie same topics? Keen3 has described the use of an intermediate language, to which all topics to be indexed are initially described, for this purpose. ther aspects of the indexing process which should be controlled are the ofessional level and experience of the indexers, and the source of the dexing, whether from full text, abstract, or title. It is an obviously biased ocedure to use the same Personnel for both indexing and searching. In dition, one would prefer to see the chief investigator in a study remain atively independent of both these operations. However, for research with mall budget, as, for example, much doctoral research, this requireme[OCRerr][OCRerr] y simply be impossible to satisfy. [he structure of the database should be appropriate to the type of query