IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
64 The pragmatics of information retrieval experimentation
Search statements can be categorized by type. Some types are:
single search elements
boolean combinations of search elements
binary vectors of search elements
weighted vectors of search elements
any of the above with syntactic requirements:
roles
facets
word adjacency
word dependency
co-occurrence frequency
The notions of specificity and exhaustivity can also be applied to search
statements, but here the development of operational measures must take into
account that search statements are often boolean expressions, i.e. combina-
tions joined by the connectives and, or, not (called conjunction, disjunction,
and negation). Measures of specificity and exhaustivity generally require that
an expression be put into a standard form, for example into disjunctive
normal form, as a disjunction of conjuncts. For example, the following is in
disjunctive normal form
(T1 A T2 A T3) V (T1 A T2 A T4) V (T4 A [OCRerr]T5)
but the following equivalent statement is not
T1 A T2 A (T3 V T4) V [OCRerr]{[OCRerr]T4 V T5}
A measure of the breadth or exhaustivity of the search is the number of or's
(V); of the specificity the average number of and's (A) per conjunct. In the
example above, the breadth is 3 and the depth is 2.67.
A problem arises in applying these measures when the system permits
truncation of search terms, as do most of the commercial online systems. In
this case, one must refer to the dictionary, if it exists, to determine how many
discrete terms correspond to a given truncation. With no dictionary, it may
be possible to determine this number by a search of the database.
Other measures relate a document representation and a search statement.
The simplest of these is co[OCRerr]ordination level or degree of match, counting the
number of terms the document and query have in common. The measure
may be normalized in a manner similar to the normalization of inter-indexer
consistency, i.e.
N(q A d) N(q A d)
or N(qvd)
$-N(q)N(d)
where the search statement q has N(q) terms, the document d N(d) terms, and
N(q Ad) terms match. Salton's cosine coefficient, which has the same form as
the document[OCRerr]ocument cosine coefficient previously defined, assumes a
vector representation of both document and query and is a more general
measure of document[OCRerr]uery similarity.
The action a searcher takes in response to a query is, in most general terms,
a sequence of search statements. It is only in searching computer databases
in batch mode that the single search statement is the norm. In searching
A