IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 72 The pragmatics of information retrieval experimentation that Will be processed. The norm in computer-based information retrieval is a file of document surrogates in random order or ordered on some semi- random attribute such as accession number, with one or more associated inverted files to access by index term, author, title term, abstract term, or other aspect of interest to the investigator. The advantage of the random sequence is that documents can be added to the file without reorganizing it. Before setting up inverted file indexes to a set of document records, a number of choices must be made: (1) What attributes or fields will be indexed? (2) Will the indexing be based on the complete string within a field or on individual words within the field? (3) Will all individual words be indexed or will there be a stop list? (4) Will words be stemmed? These choices will be dependent on the purpose of the experiment, but the wise investigator will think out all implications before setting up the database. For example, is stemming economically justifiable if a truncation operation can be used in searching? Some experimental databases, notably the Smart system, use a clustered organization. This structure often increases search efficiency and reduces search time. Of course, there is processing time involved in the original clustering, but if many searches are processed, there may be an over-all benefit. More efficient clustering algorithms are constantly being developed, so that if one intends to follow this route, a survey of recent computer science literature would be in order. A good survey of clustering algorithms is given in Hartigan16. A simple single link algorithm is described in Salton1 7. The medium of the database-whether it is computer-based or microform or printed-and, if computer-based, whether it will be accessed in batch or online mode, is a decision that will usually be made in the early stages of a project, because of its implications for the resources which will be required. Sometimes, the choice will be predetermined by the nature of the experiment or the availability of facilities. Where there is a choice, the investigator should consider the following points: (1) At the data entry stage, computer-based files are more efficient, as each docurment record needs to be keyed once only. (2) Corrections, reformating for printed output, sequencing for storage, production of multiple printed records can all be carried out automatically with machine-readable input. Even essentially manual files such as card catalogues are now produced by computer. Word processing equipment is useful in generating small printed files. (3) In-house computer files require a set of programs for the initial set-up of the database, for maintenance, and for retrieval. (See the next section for further comments on the design of these.) (4) Online systems offer much greater flexibility in searching and in analysing searches. (5) The cost differential between online and batch is rapidly changing in favour of online. Online retrieval is rapidly becoming the norm in libraries, businesses, and scientific institutions. it seems inevitable that the information retrieval field 5 Li