Information Retrieval Experiment

IRE Information Retrieval Experiment The Smart environment for retrieval system evaluation-advantages and problem areas chapter Gerard Salton Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Retrieval system environment 317 While the operational retrieval environment has thus drastically changed over the last few years, the intellectual design of the retrieval operations has remained reasonably unchanged for some decades. The following principal characteristics may be noted: (a) documents are normally indexed manually, that is, subject indicators and content descriptions are manually assigned to the bibliographic items by subject experts and professional indexers; (b) search statements are manually formulated by users or search interme- diaries using one or more acceptable search terms and appropriate boolean connectives between the terms; subsequent reformulations and improvements in the query formulations are also carried out manually; (c) the principal file search device is an auxiliary, so-called inverted directory which contains for each accepted content descriptor a list of the document references to which that term is assigned; the documents to be retrieved are then identified by comparing and merging the document reference lists corresponding to the various query terms; (d) an `exact match' retrieval strategy is carried out by retrieving all items whose content description exactly matches the term combination specified in the search request; normally, all retrieved items are considered by the system as being equally relevant to the user's needs, and no special method is provided for ranking the output items in presumed order of goodness for the user. Enhancements are included in many of the modern search systems in the form of `free text' manipulations allowing the user to choose arbitrary search terms, that is natural language terms that are not controlled by any dictionary or authority lists, leading to the retrieval of all documents whose stored texts (or text excerpts) contain a particular term combination included in the search requests. But even in the free text search mode, inverted directories are created containing all the text words that could lead to the retrieval of a given document in the collection. Additional refinements in the search mode are available in some modern online environments in the form of dictionary and vocabulary displays leading to better query formulation capabilities. However, the basic manual query formulation and exact match retrieval strategy based on inverted files is maintained in practically all operational retrieval situations. When the work on the Smart retrieval experiments was initiated in the early 1 960s, some attempts had been made at implementing so-called automatic indexing systems1[OCRerr]4. These consisted in using the computer to scan document texts, or text excerpts such as document abstracts, and in assigning as content descriptors words that occurred sufficiently frequently in a given text. The early retrieval experiments conducted with such automatic indexing products showed that a large number of the automatically chosen index terms would also have been assigned by manual indexers, and that the automatic indexing products contrary to expectation did not prove to be totally inadequate. Moreover, it appeared that the rudimentary early automatic indexing products could be easily improved. Thus linguists led the way by pointing out that a number of linguistic processes were `essential' for the generation of