IRE Information Retrieval Experiment The Smart environment for retrieval system evaluation-advantages and problem areas chapter Gerard Salton Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 324 Fhc [OCRerr]m[OCRerr]rt environment lor rctricv[OCRerr]1 system ev[OCRerr]luation the assignment of content identifiers to bibliographic items designed to lead to their retrieval when wanted or their rejection when not wanted. Normally, the indexer considers each item in isolation and assigns content terms that are related in some sense to the document content. This procedure may not lead to effective retrieval, because the choice of appropriate index terms depends not only on the contents of each individual document, but also on the contents of all other documents in the collection. For example, the term computer' may be appropriate in identifying a document entitled Uses of C[OCRerr]omputers in Medicine' if such an item is placed in a collection of medical items, most of which will necessarily be unrelated to computers. `Computer' would he a poor choice for that same document if the item were to be placed in a computer science collection, because then all other documents are also computer-oriented. Thus, indexing implies the assignment of content identifiers to documents that are capable of reflecting the document content in some sense, and that distinguish the items from each other. In the vector space environment, distinguishing the items implies decreasing their similarities, or increasing their mutual distance in the space. The requirement to create a document space that is spread-out, that is, where the distances between document vectors are as large as possible, leads to the assignment of term importance values, or term weights to the content identifiers used for indexing purposes. One such indication of term importance is the term discrimination value which measures the ability of a term to spread out the document space when assigned to the documents of a collection31 [OCRerr] In the absence of information about the actual term relevance, one can relate the term discrimination value to various occurrence frequency characteristics of the terms in a collection34'35. It turns out that the best terms will be medium-frequency terms that are not assigned to too many documents in a collection nor to too few because high-frequency terms assigned to many items in a collection render the document vectors more similar to each other, thereby compressing the space, and rendering it difficult to retrieve the individual items when wanted; low-frequency terms, on the other hand, are assigned to so few documents that their overall effect is not sufficiently felt. When medium-frequency terms are used, those items to which they are assigned are rendered more similar to each other, but at the same time the differences between such items and the remainder of the collection will be increased. This is symbolically illustrated in the document space representation of Figure 15.], where each x denotes a document, and the distance between two x's is assumed to be inversely related to the similarity in the respective document vectors. The space alteration of Figure 15.] is obviously desirable under the assumption that the items to which term k is assigned will prove jointly relevant to the users' information requests: these items are made similar to each other rendering them easily retrievable together and thus producing high recall; at the same time, these items are distinguished from the remainder of the collection, which leads to high precision and to the correct rejection of the extraneous items. The term discrimination model is used to generate an automatic indexing system in which the discriminating medium-frequency terms serve directly for indexing purposes. The high-frequency terms that compress the document