IRE
Information Retrieval Experiment
The Smart environment for retrieval system evaluation-advantages and problem areas
chapter
Gerard Salton
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
324 Fhc [OCRerr]m[OCRerr]rt environment lor rctricv[OCRerr]1 system ev[OCRerr]luation
the assignment of content identifiers to bibliographic items designed to lead
to their retrieval when wanted or their rejection when not wanted. Normally,
the indexer considers each item in isolation and assigns content terms that
are related in some sense to the document content. This procedure may not
lead to effective retrieval, because the choice of appropriate index terms
depends not only on the contents of each individual document, but also on
the contents of all other documents in the collection. For example, the term
computer' may be appropriate in identifying a document entitled Uses of
C[OCRerr]omputers in Medicine' if such an item is placed in a collection of medical
items, most of which will necessarily be unrelated to computers. `Computer'
would he a poor choice for that same document if the item were to be placed
in a computer science collection, because then all other documents are also
computer-oriented.
Thus, indexing implies the assignment of content identifiers to documents
that are capable of reflecting the document content in some sense, and that
distinguish the items from each other. In the vector space environment,
distinguishing the items implies decreasing their similarities, or increasing
their mutual distance in the space.
The requirement to create a document space that is spread-out, that is,
where the distances between document vectors are as large as possible, leads
to the assignment of term importance values, or term weights to the content
identifiers used for indexing purposes. One such indication of term
importance is the term discrimination value which measures the ability of a
term to spread out the document space when assigned to the documents of a
collection31 [OCRerr] In the absence of information about the actual term
relevance, one can relate the term discrimination value to various occurrence
frequency characteristics of the terms in a collection34'35. It turns out that
the best terms will be medium-frequency terms that are not assigned to too
many documents in a collection nor to too few because high-frequency terms
assigned to many items in a collection render the document vectors more
similar to each other, thereby compressing the space, and rendering it
difficult to retrieve the individual items when wanted; low-frequency terms,
on the other hand, are assigned to so few documents that their overall effect
is not sufficiently felt. When medium-frequency terms are used, those items
to which they are assigned are rendered more similar to each other, but at the
same time the differences between such items and the remainder of the
collection will be increased. This is symbolically illustrated in the document
space representation of Figure 15.], where each x denotes a document, and
the distance between two x's is assumed to be inversely related to the
similarity in the respective document vectors.
The space alteration of Figure 15.] is obviously desirable under the
assumption that the items to which term k is assigned will prove jointly
relevant to the users' information requests: these items are made similar to
each other rendering them easily retrievable together and thus producing
high recall; at the same time, these items are distinguished from the
remainder of the collection, which leads to high precision and to the correct
rejection of the extraneous items.
The term discrimination model is used to generate an automatic indexing
system in which the discriminating medium-frequency terms serve directly
for indexing purposes. The high-frequency terms that compress the document