IRE
Information Retrieval Experiment
The Smart environment for retrieval system evaluation-advantages and problem areas
chapter
Gerard Salton
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Basic Smart system assumptions and early results 319
15.2 Basic Smart system assumptions and early results
In the Smart system each record, or document, is represented by a vector of
terms, that is D[OCRerr] = (d[OCRerr]1, d[OCRerr]2, . . . , dft), where d[OCRerr] represents the weight or
importance of term j for document D[OCRerr]. By `term' is meant some form of
content identifier such as a word extracted from a document text, a word
phrase, a thesaurus class, an entry from a term hierarchy, etc. A query Q[OCRerr] can
be similarly represented as Q[OCRerr] (q,i, q[OCRerr]2, . . . , qft), and retrieval of a stored
item can be made to depend on the magnitude of a global similarity
coefficient s(D[OCRerr], Q). Specifically, whenever s(D[OCRerr], Q[OCRerr]) >[OCRerr] Tfor some threshold T,
D[OCRerr] is retrieved in answer to Q[OCRerr]. It should be noted that an exact match
between any particular query and document terms is never required for
retrieval of an item. Instead, the similarity measure S may be based on the
composite similarities between the full query and document vectors.
Furthermore, since s(D[OCRerr], Q[OCRerr]) represents a measure of closeness between D[OCRerr]
and Q[OCRerr], the output documents can be presented to the user population in
ranked order of presumed relevance to the user, that is, in decreasing order
of the corresponding S coefficients.
The following assumptions are immediately implied by the vector
processing environment:
I
I
I
I
I
(a) In principle, each term included in a given vector is as important as any
other term (except for the possible distinction implied by a particular
term weight assignment); that is, each term represents a particular
dimension in the t-dimensional vector space defined by the t terms used
to index the document collection.
(b) No relationships are defined between distinct terms; that is, the co-
ordinate axes representing the distinct terms are assumed to be
orthogonal.
(c) A document is represented by a particular position, and possibly by a
given length, in the t-dimensional vector space. (In practice, it is often
convenient to normalize all vectors to some given standard length.)
In examining the Smart system, it is necessary to consider also another
principal characteristic of the experimental environment, namely the use of
small sample collections of documents and user queries for test purposes.
Such a test environment makes it possible to carry out many different
experiments at reasonable cost. Furthermore, a great many inconveniences
inherent in the use of large operational collections are immediately
eliminated. Thus full relevance assessments can be obtained from the user
population of each document with respect to each query, leading to the
generation of accurate recall-precision measures. The alternative would
consist in using sampling techniques and obtaining relevance assessments
for a portion of the document collection only. The use of sampling methods,
however, introduces additional variables and the evaluation results may then
be subject to substantial fluctuations.
The small document environment used in the Smart experiments also
renders unnecessary the choice of various parameter values which would
otherwise be required to control the retrieval process. Because the documents
are ranked at the output in decreasing order of query[OCRerr]ocument similarity,