IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
I)ecision 4: [OCRerr]h[OCRerr]i [OCRerr]j'j f)I[OCRerr]C [OCRerr]() Li[OCRerr][OCRerr] *[OCRerr] 71
Collection coverage
The subjects of the documents described by a database, their age, language,
scope, medium and source can all, Potentially, affect the measures of retrieval
performance. Hence, one must either use a collection which is homogeneo[OCRerr]5
with respect to these attributes and then claim results only for this limited
sphere or attempt to randomize the collection with respect to some or all of
them. Some form of randomized selection, even within a narrow boundary,
is essential. This eliminates a Possibly unconscious bias of the experimenter
in selecting the documents. For example, if documents were to be from the
computer science field and Published during the past three years in English,
a random selection from an existing bibliographic database such as Computing
Reviews or Computer and Control Abstracts could be used. Tables of random
numbers are useful in making the selection, either to select document
numbers if items are numbered or to select pages and line number within the
page if they are not.
Form of surrogate
The form of the document surrogates, whether citations only or citations
with index terms, abstracts, full text, etc., should be appropriate to the
hypothesi5 under test. Also, form of output Presented to a user affects
relevance judgement5 It is essential, therefore, that all entries in the database
be in the same form. Also, if real users, with real information needs, are
involved in the experiment, there should be access to the full text of the
documents themselves, if only for `public relations'.
One might not wish, however, to make decisions about record form solely
on the basis of Present needs. Because of the expense of setting up an
experimental database, consideration should be given to future use of it, both
by the investigator and by others. If additional fields can be input at very
little added cost, fields which have a high Probability of being useful in later
[OCRerr]Periments, it often saves time to include them, Particularly if short. Also,
t is useful to include one or two blank fields in a computer record, which can
)e assigned later.
[OCRerr]aractefistics of the indexing
.[OCRerr]documents are indexed using a number of different languages, how will the
Ivestigator ensure that Parallel index records in different languages cover
ie same topics? Keen3 has described the use of an intermediate language,
to which all topics to be indexed are initially described, for this purpose.
ther aspects of the indexing process which should be controlled are the
ofessional level and experience of the indexers, and the source of the
dexing, whether from full text, abstract, or title. It is an obviously biased
ocedure to use the same Personnel for both indexing and searching. In
dition, one would prefer to see the chief investigator in a study remain
atively independent of both these operations. However, for research with
mall budget, as, for example, much doctoral research, this requireme[OCRerr][OCRerr]
y simply be impossible to satisfy.
[he structure of the database should be appropriate to the type of query