IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Decision 3; How to operationalize the variables'? 61 5.3 Decision 3: How to operationalize the variables? A variable is simply some attribute or feature qualitative or quantitative- (`I retrieval system. In experimental work, variables are normally classified .i[OCRerr] independent or input or system variables, on the one hand, and dependent ([OCRerr]r output or performance variables, on the other hand. The independent v'iriables are the ones the experimenter manipulates or controls in order to [OCRerr]lctermine the effect on the dependent variables for example, effect of udexing depth on indexing speed; effect of term linkages on recall and 1)recision. It must be remembered that there are no a priori independent and [OCRerr]Icpendent variables. An independent variable in one experiment may serve i.[OCRerr] a dependent variable in another. For example, one study might be of the cfTect of different indexing language features on speed of indexing, another &[OCRerr]n the effect of speed of indexing on the number of indexing errors. The three previous chapters have discussed many of the independent and (lependent variables of previous information retrieval experiments. This chapter will be concerned solely with procedures for actually observing or measuring them, i.e. with operationalizing them. Because of the great variety of variables encountered in information retrieval variables characterizing the document collection and database, the indexing languages, the queries and search processes, the people associated with a retrieval operation, and the evaluation of the output[OCRerr]discussion will be retricted to those which have been previously operationalized. This is not to suggest that these operation- alizations are necessarily the best or only ones. However, the approaches may be useful to the new information scientist. What follows is in the form of a listing of the major categories of variables with some suggestions for operationalization. Document collection and database Variables in this category relate to the size, source, form, medium, and broad subject coverage of the collection and to the type of document representation or surrogate used in the database. The term database here will refer to the collection of document surrogates and any associated indexes or access files. The term is used very generally and includes printed indexes, card catalogues, microform indexes and catalogues, and tape or disk files, whether accessed in batch or interactive mode. Since most of these variables are qualitative, the main problem here is to define appropriate categories. What are the possible document forms-[OCRerr]monograph, journal article, technical report, patent, etc., and to what extent can they be applied across the different media print, micromedia, film, recordings, machine-readable text? What are the possible elements in a database record- author, title, source, index terms, abstract, full text, etc.? An important variable here is the heterogeneity of the collection. One approach, presented by Brookes 2,is a measure of categorical dispersion. The documents are assigned to n subject categories. The categories are ranked and the frequency in each category is determined. If j'(r) represents the frequency of the rth category, the mean rank will be m [OCRerr] rf(r) Zrf(r)