IRE Information Retrieval Experiment The Smart environment for retrieval system evaluation-advantages and problem areas chapter Gerard Salton Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Theoretical insights 325 Assign term (a) (b) Figure 15.1. Basic document space alteration. (a) Before assignment of term k; (b) after assignment of term k space when used for indexing purposes must be rendered more specific: their frequency of assignment can be decreased by incorporating the terms into term phrases, and assigning the phrases as content identifiers (for example, instead of using `computer' as an index term, one could form the phrases `computer programmer', or `computer hardware', or `computer security'). Low-frequency terms, on the other hand, can be broadened by incorporating the terms into thesaurus classes consisting of groups of related or synonymous terms. Each thesaurus class necessarily exhibits a higher assignment frequency in a collection, than the individual terms included in a thesaurus class36. The vector space model of information representation and retrieval is thus capable of assigning a specific interpretation to well-known intellectual content analysis aids such as term grouping and thesauri, and this role is different from the standard semantic functions of such devices in linguistics. When relevance information of documents with respect to search requests is available in retrieval (as is the case in many systems that provide user-system interaction), then a term relevance factor known as term precision can be computed as the proportion of relevant items containing a given term to total number of items (or to number of non-relevant items) containing the term. It is clear that terms with a high precision factor are capable of distinguishing the relevant items from the non-relevant ones; the neutral term discrimination weights can then be replaced by term precision weights. It has been shown that a term weighting system based on the use of term precision is theoretically optimum for the binary-independent retrieval model (where binary weighted terms are independently assigned to queries and docu- ments)37' 38 Furthermore, a good deal of experimental evidence is available demonstrating the usefulness of the term relevance factors even in cases where the binary independent model does not strictly apply39'40 For the binary-independent model which is relatively easy to treat mathematically, various Smart procedures can also be shown to be formally effective under specified circumstances. Thus an effective cluster search method is available which is capable of concentrating the search effort in the most productive areas of a classified collection41. Formally effective document and query vector alteration methods have also been studied, including in particular the Smart relevance feedback process42'43.