IRE
Information Retrieval Experiment
Laboratory tests: automatic systems
chapter
Robert N. Oddy
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
162 Laboratory tests: automatic Systems
For instance, the inversion of the document descriptions file contains one
record for each term: the term number followed by a list of document
numbers. An extremely clear account of the organization of test collection
data for experimental information retrieval is to be found in Sparck Jones
and Bates' report6 (p. Dl fi) on their work at Cambridge University. They
point out the formal similarity between files of different purport. For
instance, a file in what they call `a b' form consists of a sequence of numbers,
a, each element of which is followed by a list of numbers, b, terminated by a
character'/'. A set of document descriptions can be encoded in the `a b' form
(see Figure 9.2), as can a set of queries, and a set of relevance judgements. A
single program can be used to invert any of these files, because they are in a
standard format, and the same is true of any other process that is required.
Having generated the primary files (document descriptions, queries,
relevance judgements) from raw data, Sparck Jones and Bates went on to
create a standard set of auxiliary files, such as inversions and frequency data,
as a matter of course, for all their test collections.
I
I
½
Document 2[OCRerr]17
number [OCRerr]466
Term
number
5 57 86 101 110 190 583
19 157 193 282 291 407
583 666 702
78 86 96
Figure 9.2. Document descriptions in `a b' form
163 740
U
I
U
If an experimenter wishes to work from raw data, he must equip himself
with programs to derive the numerical representation of the test collection.
I
Typically, the textual material is first processed to form a dictionary of terms,
U
or `stems', with associated term numbers. Then, the texts are scanned again, -
their component words matched against the terms in the dictionary, and
replaced by the corresponding term numbers. The algorithms employed to
construct the dictionary vary from one experimental system to another, [OCRerr]
usually in minor ways, and may include automatic suffix stripping and allow
for the manual inclusion of synonyms. Accounts of the principles underlying [OCRerr]
these methods can be found in van Rusbergen12 and Salton25. Note that at
I
the moment I am concerned with the primitive data comprising a test
collection: the more sophisticated indexing structures which have been the I
subject of most recent automatic information retrieval experimen tation are
derivations or transformations of them. There are a few exceptions. For
example, some retrieval methods require a syntactic analysis of the document -
and query texts26' 27 There has been very little evidence of this type of work
in information retrieval laboratories for several years now, following
generally disappointing results28' 29 Other approaches make use of the
positions of words in the text, so a concordance must be generated from the