IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 162 Laboratory tests: automatic Systems For instance, the inversion of the document descriptions file contains one record for each term: the term number followed by a list of document numbers. An extremely clear account of the organization of test collection data for experimental information retrieval is to be found in Sparck Jones and Bates' report6 (p. Dl fi) on their work at Cambridge University. They point out the formal similarity between files of different purport. For instance, a file in what they call `a b' form consists of a sequence of numbers, a, each element of which is followed by a list of numbers, b, terminated by a character'/'. A set of document descriptions can be encoded in the `a b' form (see Figure 9.2), as can a set of queries, and a set of relevance judgements. A single program can be used to invert any of these files, because they are in a standard format, and the same is true of any other process that is required. Having generated the primary files (document descriptions, queries, relevance judgements) from raw data, Sparck Jones and Bates went on to create a standard set of auxiliary files, such as inversions and frequency data, as a matter of course, for all their test collections. I I ½ Document 2[OCRerr]17 number [OCRerr]466 Term number 5 57 86 101 110 190 583 19 157 193 282 291 407 583 666 702 78 86 96 Figure 9.2. Document descriptions in `a b' form 163 740 U I U If an experimenter wishes to work from raw data, he must equip himself with programs to derive the numerical representation of the test collection. I Typically, the textual material is first processed to form a dictionary of terms, U or `stems', with associated term numbers. Then, the texts are scanned again, - their component words matched against the terms in the dictionary, and replaced by the corresponding term numbers. The algorithms employed to construct the dictionary vary from one experimental system to another, [OCRerr] usually in minor ways, and may include automatic suffix stripping and allow for the manual inclusion of synonyms. Accounts of the principles underlying [OCRerr] these methods can be found in van Rusbergen12 and Salton25. Note that at I the moment I am concerned with the primitive data comprising a test collection: the more sophisticated indexing structures which have been the I subject of most recent automatic information retrieval experimen tation are derivations or transformations of them. There are a few exceptions. For example, some retrieval methods require a syntactic analysis of the document - and query texts26' 27 There has been very little evidence of this type of work in information retrieval laboratories for several years now, following generally disappointing results28' 29 Other approaches make use of the positions of words in the text, so a concordance must be generated from the