IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Decision 3; How to operationalize the variables'? 61
5.3 Decision 3: How to operationalize the variables?
A variable is simply some attribute or feature qualitative or quantitative-
(`I retrieval system. In experimental work, variables are normally classified
.i[OCRerr] independent or input or system variables, on the one hand, and dependent
([OCRerr]r output or performance variables, on the other hand. The independent
v'iriables are the ones the experimenter manipulates or controls in order to
[OCRerr]lctermine the effect on the dependent variables for example, effect of
udexing depth on indexing speed; effect of term linkages on recall and
1)recision. It must be remembered that there are no a priori independent and
[OCRerr]Icpendent variables. An independent variable in one experiment may serve
i.[OCRerr] a dependent variable in another. For example, one study might be of the
cfTect of different indexing language features on speed of indexing, another
&[OCRerr]n the effect of speed of indexing on the number of indexing errors.
The three previous chapters have discussed many of the independent and
(lependent variables of previous information retrieval experiments. This
chapter will be concerned solely with procedures for actually observing or
measuring them, i.e. with operationalizing them. Because of the great variety
of variables encountered in information retrieval variables characterizing
the document collection and database, the indexing languages, the queries
and search processes, the people associated with a retrieval operation, and
the evaluation of the output[OCRerr]discussion will be retricted to those which have
been previously operationalized. This is not to suggest that these operation-
alizations are necessarily the best or only ones. However, the approaches
may be useful to the new information scientist.
What follows is in the form of a listing of the major categories of variables
with some suggestions for operationalization.
Document collection and database
Variables in this category relate to the size, source, form, medium, and broad
subject coverage of the collection and to the type of document representation
or surrogate used in the database. The term database here will refer to the
collection of document surrogates and any associated indexes or access files.
The term is used very generally and includes printed indexes, card catalogues,
microform indexes and catalogues, and tape or disk files, whether accessed
in batch or interactive mode. Since most of these variables are qualitative,
the main problem here is to define appropriate categories. What are the
possible document forms-[OCRerr]monograph, journal article, technical report,
patent, etc., and to what extent can they be applied across the different
media print, micromedia, film, recordings, machine-readable text? What
are the possible elements in a database record- author, title, source, index
terms, abstract, full text, etc.?
An important variable here is the heterogeneity of the collection. One
approach, presented by Brookes 2,is a measure of categorical dispersion. The
documents are assigned to n subject categories. The categories are ranked
and the frequency in each category is determined. If j'(r) represents the
frequency of the rth category, the mean rank will be
m [OCRerr] rf(r)
Zrf(r)