IRE
Information Retrieval Experiment
The Smart environment for retrieval system evaluation-advantages and problem areas
chapter
Gerard Salton
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Retrieval system environment 317
While the operational retrieval environment has thus drastically changed
over the last few years, the intellectual design of the retrieval operations has
remained reasonably unchanged for some decades. The following principal
characteristics may be noted:
(a) documents are normally indexed manually, that is, subject indicators and
content descriptions are manually assigned to the bibliographic items by
subject experts and professional indexers;
(b) search statements are manually formulated by users or search interme-
diaries using one or more acceptable search terms and appropriate
boolean connectives between the terms; subsequent reformulations and
improvements in the query formulations are also carried out manually;
(c) the principal file search device is an auxiliary, so-called inverted directory
which contains for each accepted content descriptor a list of the
document references to which that term is assigned; the documents to be
retrieved are then identified by comparing and merging the document
reference lists corresponding to the various query terms;
(d) an `exact match' retrieval strategy is carried out by retrieving all items
whose content description exactly matches the term combination
specified in the search request; normally, all retrieved items are
considered by the system as being equally relevant to the user's needs,
and no special method is provided for ranking the output items in
presumed order of goodness for the user.
Enhancements are included in many of the modern search systems in the
form of `free text' manipulations allowing the user to choose arbitrary search
terms, that is natural language terms that are not controlled by any dictionary
or authority lists, leading to the retrieval of all documents whose stored texts
(or text excerpts) contain a particular term combination included in the
search requests. But even in the free text search mode, inverted directories
are created containing all the text words that could lead to the retrieval of a
given document in the collection. Additional refinements in the search mode
are available in some modern online environments in the form of dictionary
and vocabulary displays leading to better query formulation capabilities.
However, the basic manual query formulation and exact match retrieval
strategy based on inverted files is maintained in practically all operational
retrieval situations.
When the work on the Smart retrieval experiments was initiated in the
early 1 960s, some attempts had been made at implementing so-called
automatic indexing systems1[OCRerr]4. These consisted in using the computer to
scan document texts, or text excerpts such as document abstracts, and in
assigning as content descriptors words that occurred sufficiently frequently
in a given text. The early retrieval experiments conducted with such
automatic indexing products showed that a large number of the automatically
chosen index terms would also have been assigned by manual indexers, and
that the automatic indexing products contrary to expectation did not prove
to be totally inadequate.
Moreover, it appeared that the rudimentary early automatic indexing
products could be easily improved. Thus linguists led the way by pointing out
that a number of linguistic processes were `essential' for the generation of