IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
62 The pragmatics of information retrieval experimentation
and a measure of dispersion which varies between 0 and 1 is
D 2(m-1)
n-i
Brookes has modified this measure to take into account the relative sizes in
the total population.
Indexing languages and Indexing procedures
Many investigators have contributed to the operationalization of indexing
variables. Below are listed some of the more generally accepted definitions.
I
(1) Exhaustivity of indexing, i.e. the number of topics covered by the
indexing. Operational definition: number of index terms/document.
Keen and Wheatley3 suggest redundant indexing, for example synonyms
and morphological variants, be eliminated.
(2) Specificity of indexing, i.e. the preciseness of the subject description.
Operational definition: number of postings per term. This value,
however, may depend as much on fashions in the literature as on the
specificity of a term.
(3) Degree of linkage in a vocabulary. Operational definition: number of
references in the dictionary or thesaurus. Keen suggests that only see also
references, not see references, be included.
(4) Degree of vocabulary control. Operational definition: number of terms
in the entry vocabulary/number of terms in the indexing vocabulary. If
indexing is uncontrolled this value becomes 1.
(5) Term discrimination value. Operational definition (Salton, Wong and
Yu4): Document surrogates are vectors (binary or weighted) of index
terms. The similarity between two documents
d1 = (d11, d12, . . ., d1[OCRerr])
d2 = (d21, d22, . . ., d2[OCRerr])
is measured by the cosine coefficient:
[OCRerr];1=1d1[OCRerr]d2[OCRerr]
S(d1,d2) [OCRerr] d12[OCRerr][OCRerr];=1 d;[OCRerr])112
The centroid of a set of documents is
C = (c1, c2, . . ., c[OCRerr])
where
Lm=1d[OCRerr]
m
The summation is over all m documents
discrimination value of the jth term is then
D
j- Q
in the set. The term
where Q, the compactness of the collection, is defined as the average
similarity of the documents with the centroid
I