Information Retrieval Experiment

IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 62 The pragmatics of information retrieval experimentation and a measure of dispersion which varies between 0 and 1 is D 2(m-1) n-i Brookes has modified this measure to take into account the relative sizes in the total population. Indexing languages and Indexing procedures Many investigators have contributed to the operationalization of indexing variables. Below are listed some of the more generally accepted definitions. I (1) Exhaustivity of indexing, i.e. the number of topics covered by the indexing. Operational definition: number of index terms/document. Keen and Wheatley3 suggest redundant indexing, for example synonyms and morphological variants, be eliminated. (2) Specificity of indexing, i.e. the preciseness of the subject description. Operational definition: number of postings per term. This value, however, may depend as much on fashions in the literature as on the specificity of a term. (3) Degree of linkage in a vocabulary. Operational definition: number of references in the dictionary or thesaurus. Keen suggests that only see also references, not see references, be included. (4) Degree of vocabulary control. Operational definition: number of terms in the entry vocabulary/number of terms in the indexing vocabulary. If indexing is uncontrolled this value becomes 1. (5) Term discrimination value. Operational definition (Salton, Wong and Yu4): Document surrogates are vectors (binary or weighted) of index terms. The similarity between two documents d1 = (d11, d12, . . ., d1[OCRerr]) d2 = (d21, d22, . . ., d2[OCRerr]) is measured by the cosine coefficient: [OCRerr];1=1d1[OCRerr]d2[OCRerr] S(d1,d2) [OCRerr] d12[OCRerr][OCRerr];=1 d;[OCRerr])112 The centroid of a set of documents is C = (c1, c2, . . ., c[OCRerr]) where Lm=1d[OCRerr] m The summation is over all m documents discrimination value of the jth term is then D j- Q in the set. The term where Q, the compactness of the collection, is defined as the average similarity of the documents with the centroid I