IRE
Information Retrieval Experiment
The Smart environment for retrieval system evaluation-advantages and problem areas
chapter
Gerard Salton
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
The extended Smart system 321
I
word stems from document titles and abstracts, possibly supplemented by
the use of a term classification, or thesaurus, designed to recognize some
synonyms and related terms10[OCRerr]12
At first, the evaluation results were thought to be indicative of flaws in the
system design, and the decision was made to redesign the Smart environment
so as to create a more flexible retrieval environment. In time, several other
large-scale retrieval tests carried out independently of the Smart environment
have, however, confirmed the original Smart results. In particular, the well-
known Aslib Cranfield project also found that the simpler indexing
methodologies were more effective than the more complex ones, and at the
present time, there is an understanding among retrieval experts that an
overspecification of document content normally produced by the more
refined indexing methodologies can be just as detrimental as an underspeci-
fication1 3. This evidence does not, however, prevent many people from still
clamouring for more sophisticated linguistic analysis procedures to be
incorporated into automatic indexing systems, or indeed from incorporating
such methodologies into newly designed retrieval systems14
The extended Smart system is briefly described in the next section and the
various insights gained from the Smart experimentation are discussed in the
remainder of the study.
15.3 The extended Smart system
Since the initial Smart experiments were in a sense `unsuccessful', it seemed
reasonable to generalize the basic experimental framework in an attempt to
determine just what went wrong with the early tests, and to identify indexing
search and retrieval methods that would actually prove effective. Accordingly,
an extended system was developed with the following capabilities:
(a) A large number of automatic indexing procedures were made available
including operations with automatically generated term associations,
and term hierarchies. Furthermore the indexing products could be
derived by analysing document titles only, titles and abstracts, or full
document texts, and the query[OCRerr]ocument comparisons could be carried
15
out using a variety of similarity measures
(b) So-called relevance feedback capabilities were implemented making it
possible automatically to generate improved query formulations based
on relevance assessments submitted by the users in response to previously
retrieved documents. A given user-system interaction could then be
carried out in several steps using continually improved query formulations
until satisfactory output would be obtained16
(c) Various file organizations could be used including classified, or clustered,
collections in which a partial traversal of the stored records would quick1ly
lead to the retrieval of items in areas of interest to the user population
Extensions were also considered by applying the automatic procedures to
foreign language documents, and by utilizing bibliographic citations as
content identifiers'8' 19 Eventually, the Smart procedures were compared
with the conventional inverted file technologies based on manually assigned
keywords to the documents of a collection20' 21
A full discussion of the retrieval results is beyond the scope of this study.