IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
72 The pragmatics of information retrieval experimentation
that Will be processed. The norm in computer-based information retrieval is
a file of document surrogates in random order or ordered on some semi-
random attribute such as accession number, with one or more associated
inverted files to access by index term, author, title term, abstract term, or
other aspect of interest to the investigator. The advantage of the random
sequence is that documents can be added to the file without reorganizing it.
Before setting up inverted file indexes to a set of document records, a
number of choices must be made:
(1) What attributes or fields will be indexed?
(2) Will the indexing be based on the complete string within a field or on
individual words within the field?
(3) Will all individual words be indexed or will there be a stop list?
(4) Will words be stemmed?
These choices will be dependent on the purpose of the experiment, but the
wise investigator will think out all implications before setting up the
database. For example, is stemming economically justifiable if a truncation
operation can be used in searching?
Some experimental databases, notably the Smart system, use a clustered
organization. This structure often increases search efficiency and reduces
search time. Of course, there is processing time involved in the original
clustering, but if many searches are processed, there may be an over-all
benefit. More efficient clustering algorithms are constantly being developed,
so that if one intends to follow this route, a survey of recent computer science
literature would be in order. A good survey of clustering algorithms is given
in Hartigan16. A simple single link algorithm is described in Salton1 7.
The medium of the database-whether it is computer-based or microform
or printed-and, if computer-based, whether it will be accessed in batch or
online mode, is a decision that will usually be made in the early stages of a
project, because of its implications for the resources which will be required.
Sometimes, the choice will be predetermined by the nature of the experiment
or the availability of facilities. Where there is a choice, the investigator
should consider the following points:
(1) At the data entry stage, computer-based files are more efficient, as each
docurment record needs to be keyed once only.
(2) Corrections, reformating for printed output, sequencing for storage,
production of multiple printed records can all be carried out automatically
with machine-readable input. Even essentially manual files such as card
catalogues are now produced by computer. Word processing equipment
is useful in generating small printed files.
(3) In-house computer files require a set of programs for the initial set-up of
the database, for maintenance, and for retrieval. (See the next section for
further comments on the design of these.)
(4) Online systems offer much greater flexibility in searching and in analysing
searches.
(5) The cost differential between online and batch is rapidly changing in
favour of online.
Online retrieval is rapidly becoming the norm in libraries, businesses, and
scientific institutions. it seems inevitable that the information retrieval field
5
Li