Information Retrieval Experiment

IRE Information Retrieval Experiment The Smart environment for retrieval system evaluation-advantages and problem areas chapter Gerard Salton Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The extended Smart system 321 I word stems from document titles and abstracts, possibly supplemented by the use of a term classification, or thesaurus, designed to recognize some synonyms and related terms10[OCRerr]12 At first, the evaluation results were thought to be indicative of flaws in the system design, and the decision was made to redesign the Smart environment so as to create a more flexible retrieval environment. In time, several other large-scale retrieval tests carried out independently of the Smart environment have, however, confirmed the original Smart results. In particular, the well- known Aslib Cranfield project also found that the simpler indexing methodologies were more effective than the more complex ones, and at the present time, there is an understanding among retrieval experts that an overspecification of document content normally produced by the more refined indexing methodologies can be just as detrimental as an underspeci- fication1 3. This evidence does not, however, prevent many people from still clamouring for more sophisticated linguistic analysis procedures to be incorporated into automatic indexing systems, or indeed from incorporating such methodologies into newly designed retrieval systems14 The extended Smart system is briefly described in the next section and the various insights gained from the Smart experimentation are discussed in the remainder of the study. 15.3 The extended Smart system Since the initial Smart experiments were in a sense `unsuccessful', it seemed reasonable to generalize the basic experimental framework in an attempt to determine just what went wrong with the early tests, and to identify indexing search and retrieval methods that would actually prove effective. Accordingly, an extended system was developed with the following capabilities: (a) A large number of automatic indexing procedures were made available including operations with automatically generated term associations, and term hierarchies. Furthermore the indexing products could be derived by analysing document titles only, titles and abstracts, or full document texts, and the query[OCRerr]ocument comparisons could be carried 15 out using a variety of similarity measures (b) So-called relevance feedback capabilities were implemented making it possible automatically to generate improved query formulations based on relevance assessments submitted by the users in response to previously retrieved documents. A given user-system interaction could then be carried out in several steps using continually improved query formulations until satisfactory output would be obtained16 (c) Various file organizations could be used including classified, or clustered, collections in which a partial traversal of the stored records would quick1ly lead to the retrieval of items in areas of interest to the user population Extensions were also considered by applying the automatic procedures to foreign language documents, and by utilizing bibliographic citations as content identifiers'8' 19 Eventually, the Smart procedures were compared with the conventional inverted file technologies based on manually assigned keywords to the documents of a collection20' 21 A full discussion of the retrieval results is beyond the scope of this study.