IRE
Information Retrieval Experiment
The Smart environment for retrieval system evaluation-advantages and problem areas
chapter
Gerard Salton
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Theoretical insights 325
Assign term
(a)
(b)
Figure 15.1. Basic document space alteration. (a) Before assignment
of term k; (b) after assignment of term k
space when used for indexing purposes must be rendered more specific: their
frequency of assignment can be decreased by incorporating the terms into
term phrases, and assigning the phrases as content identifiers (for example,
instead of using `computer' as an index term, one could form the phrases
`computer programmer', or `computer hardware', or `computer security').
Low-frequency terms, on the other hand, can be broadened by incorporating
the terms into thesaurus classes consisting of groups of related or synonymous
terms. Each thesaurus class necessarily exhibits a higher assignment
frequency in a collection, than the individual terms included in a thesaurus
class36.
The vector space model of information representation and retrieval is thus
capable of assigning a specific interpretation to well-known intellectual
content analysis aids such as term grouping and thesauri, and this role is
different from the standard semantic functions of such devices in linguistics.
When relevance information of documents with respect to search requests is
available in retrieval (as is the case in many systems that provide user-system
interaction), then a term relevance factor known as term precision can be
computed as the proportion of relevant items containing a given term to total
number of items (or to number of non-relevant items) containing the term. It
is clear that terms with a high precision factor are capable of distinguishing
the relevant items from the non-relevant ones; the neutral term discrimination
weights can then be replaced by term precision weights. It has been shown
that a term weighting system based on the use of term precision is
theoretically optimum for the binary-independent retrieval model (where
binary weighted terms are independently assigned to queries and docu-
ments)37' 38 Furthermore, a good deal of experimental evidence is available
demonstrating the usefulness of the term relevance factors even in cases
where the binary independent model does not strictly apply39'40
For the binary-independent model which is relatively easy to treat
mathematically, various Smart procedures can also be shown to be formally
effective under specified circumstances. Thus an effective cluster search
method is available which is capable of concentrating the search effort in the
most productive areas of a classified collection41. Formally effective
document and query vector alteration methods have also been studied,
including in particular the Smart relevance feedback process42'43.