Information Retrieval Experiment

IRE Information Retrieval Experiment The Smart environment for retrieval system evaluation-advantages and problem areas chapter Gerard Salton Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Basic Smart system assumptions and early results 319 15.2 Basic Smart system assumptions and early results In the Smart system each record, or document, is represented by a vector of terms, that is D[OCRerr] = (d[OCRerr]1, d[OCRerr]2, . . . , dft), where d[OCRerr] represents the weight or importance of term j for document D[OCRerr]. By `term' is meant some form of content identifier such as a word extracted from a document text, a word phrase, a thesaurus class, an entry from a term hierarchy, etc. A query Q[OCRerr] can be similarly represented as Q[OCRerr] (q,i, q[OCRerr]2, . . . , qft), and retrieval of a stored item can be made to depend on the magnitude of a global similarity coefficient s(D[OCRerr], Q). Specifically, whenever s(D[OCRerr], Q[OCRerr]) >[OCRerr] Tfor some threshold T, D[OCRerr] is retrieved in answer to Q[OCRerr]. It should be noted that an exact match between any particular query and document terms is never required for retrieval of an item. Instead, the similarity measure S may be based on the composite similarities between the full query and document vectors. Furthermore, since s(D[OCRerr], Q[OCRerr]) represents a measure of closeness between D[OCRerr] and Q[OCRerr], the output documents can be presented to the user population in ranked order of presumed relevance to the user, that is, in decreasing order of the corresponding S coefficients. The following assumptions are immediately implied by the vector processing environment: I I I I I (a) In principle, each term included in a given vector is as important as any other term (except for the possible distinction implied by a particular term weight assignment); that is, each term represents a particular dimension in the t-dimensional vector space defined by the t terms used to index the document collection. (b) No relationships are defined between distinct terms; that is, the co- ordinate axes representing the distinct terms are assumed to be orthogonal. (c) A document is represented by a particular position, and possibly by a given length, in the t-dimensional vector space. (In practice, it is often convenient to normalize all vectors to some given standard length.) In examining the Smart system, it is necessary to consider also another principal characteristic of the experimental environment, namely the use of small sample collections of documents and user queries for test purposes. Such a test environment makes it possible to carry out many different experiments at reasonable cost. Furthermore, a great many inconveniences inherent in the use of large operational collections are immediately eliminated. Thus full relevance assessments can be obtained from the user population of each document with respect to each query, leading to the generation of accurate recall-precision measures. The alternative would consist in using sampling techniques and obtaining relevance assessments for a portion of the document collection only. The use of sampling methods, however, introduces additional variables and the evaluation results may then be subject to substantial fluctuations. The small document environment used in the Smart experiments also renders unnecessary the choice of various parameter values which would otherwise be required to control the retrieval process. Because the documents are ranked at the output in decreasing order of query[OCRerr]ocument similarity,