IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 64 The pragmatics of information retrieval experimentation Search statements can be categorized by type. Some types are: single search elements boolean combinations of search elements binary vectors of search elements weighted vectors of search elements any of the above with syntactic requirements: roles facets word adjacency word dependency co-occurrence frequency The notions of specificity and exhaustivity can also be applied to search statements, but here the development of operational measures must take into account that search statements are often boolean expressions, i.e. combina- tions joined by the connectives and, or, not (called conjunction, disjunction, and negation). Measures of specificity and exhaustivity generally require that an expression be put into a standard form, for example into disjunctive normal form, as a disjunction of conjuncts. For example, the following is in disjunctive normal form (T1 A T2 A T3) V (T1 A T2 A T4) V (T4 A [OCRerr]T5) but the following equivalent statement is not T1 A T2 A (T3 V T4) V [OCRerr]{[OCRerr]T4 V T5} A measure of the breadth or exhaustivity of the search is the number of or's (V); of the specificity the average number of and's (A) per conjunct. In the example above, the breadth is 3 and the depth is 2.67. A problem arises in applying these measures when the system permits truncation of search terms, as do most of the commercial online systems. In this case, one must refer to the dictionary, if it exists, to determine how many discrete terms correspond to a given truncation. With no dictionary, it may be possible to determine this number by a search of the database. Other measures relate a document representation and a search statement. The simplest of these is co[OCRerr]ordination level or degree of match, counting the number of terms the document and query have in common. The measure may be normalized in a manner similar to the normalization of inter-indexer consistency, i.e. N(q A d) N(q A d) or N(qvd) $-N(q)N(d) where the search statement q has N(q) terms, the document d N(d) terms, and N(q Ad) terms match. Salton's cosine coefficient, which has the same form as the document[OCRerr]ocument cosine coefficient previously defined, assumes a vector representation of both document and query and is a more general measure of document[OCRerr]uery similarity. The action a searcher takes in response to a query is, in most general terms, a sequence of search statements. It is only in searching computer databases in batch mode that the single search statement is the norm. In searching A