IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The test collection 161 text of the documents may be available in machine-readable form. More frequently, it will be the abstracts that are available, or perhaps, only the titles. Index terms may have been assigned to the documents manually, using one or more indexing languages. Queries are offen expressed as short natural language questions or statements, and may be accompanied by terms chosen manually from an indexing language. The constitution of the raw data will, of course, result from compromises between the experimenter's objectives and data available. Each query must have associated with it a list of documents judged relevant: the judgements are usually recorded as points on a simple ordinal scale of relevance. Before retrieval tests are conducted, the collection is typically transformed into a convenient numerical form. Some test collections are used by a number of researchers working in different institutions. The collection constructed by Cleverdon in the mid- 1960s at Cranfield14 is one that has seen, perhaps, the heaviest use in computer-based tests, although it was built for a manually executed experiment. An experimenter who chooses to make use of an existing test collection may use the raw data, if he wishes to try a new text processing technique, or he may be able to take advantage of existing indexing as embodied in the numerically coded data. All but a small minority of experiments have assumed a very simple structure for document descriptions and queries; the same structure serves for both. A document (or query) is represented by a set of weighted terms, for example: cluster (6), file (4), method (4), document (3), single-link (3), hierarchy (3), algorithm (2), compare (2), time (2), heuristic (1), theory (1),.. The weights are derived from the text and usually indicate the importance, In some sense, of the term to the subject matter of the document (query). The terms can be numbered serially (in an arbitrary order), and if the collection Is static, as it invariably is in laboratory tests, the representative can be described as a vector whose elements are term weights. This is the symbolism employed in most of the literature on the SMART experiments8. In many tests, a special instance of this representational structure is used, namely a set of unweighted terms: if the weights are all the same, they can be dropped from the descriptions. The more complex structures which have appeared in experimental work, such as term classes2 and document clusters3' 8, 24, are derived from simple representatives like these. In a laboratory situation, it is very unusual for the automatic system to operate in the presence of human enquirers. Consequently, it is not necessary to provide facilities for interpreting and displaying textual data in the search programs. Retrieval and information structuring programs can be simplified considerably if they do not have to handle text. Therefore, documents, queries and terms are given numerical names (serial numbers). To the programs, a document description is a document number followed by a list of term numbers, each of which may, in some collections, be accompanied by a numerical weight. A query has the same structure. The relevance judgements pertaining to a query are typically denoted by a query number followed by a list of the serial numbers of documents deemed relevant. From files of records of this sort, other files can be derived by relatively straightforward programs, for the later convenience of retrieval programs.