IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
(<omponents of the archetype t 7
be typical of all the relevant documents missed by A. But we can go some way
towards minimizing this bias by making systems A and B as different as
possible, and/or by using many different systems (B, C, . . .) to help find other
relevant documents missed by A. We can also ask the requester, before
putting the request to the system, for any relevant documents that he:shc
already knows about.
All such methods have limitations, ([OCRerr]ind unf[OCRerr])rtun([OCRerr]itely it is not kno[OCRerr]n, in
general, how good they are. It seems likely, for instance. that there are some
relevant documents that are never retrieved, and presumably have particul[OCRerr]r
characteristics that are not detected by such methods. ltowe\ er no prictic[OCRerr]t1
alternatives exist.
Obtaining relevance assessments
The very vague definition of relevance given above (how closely the document
matches the user's need) is certainly not sufficient as a basis for an experiment.
What aspects of the process of obtaining relevance assessments do we need
to consider in more detail?
The first question is: Who is to make the assessments? In the ideal case,
where the request is stimulated by a genuine information need, clearly the
requester should be the one to decide on relevance. This may cause problems.
since the requester may not be prepared to assess as manv documents as
desired (for good experimental reasons) by the experimenter. In the past,
many experiments have relied on third parties, particularly when assessments
are required of documents not retrieved. The third party may act as substitute
or as pre-selector for the requester him/herself in the matter of relevance
This practice is regarded with increasing distrust, though it is hard in sonic
cases to see any alternative. How reliable it is is not known
Next, how much of the document should the relevance judge see bet[OCRerr]rc
making a judgement? Again, the ideal is clearly the entire text ot thc
document; but again, this is usually out of the question: usually titles or
abstracts are used. There has been some work on the prediction of relevance
(of full texts) on the basis of titles or abstracts, and it tends to show that titles
alone are very bad indicators, abstracts are better but still leave a lot to be
desired. It might be reasonable to postulate that for SOfl1( tests, such
discrepancies will not matter too much, as they will affect all the svstem[OCRerr]
being compared equally. But it remains just that a reasonable postulate
The question of which documents should be judged has in effect been
discussed above. One would often like the whole collection issessed, but this
will usually be impossible. More likely, the judged set lor each que[OCRerr]' wi
consist of the pooled output of various searches [OCRerr])n different svstems.
including perhaps systems other than those under test, or possibly a sample
from such a pool.
The order in which the documents are presented to the judge may be
important. In some sense it is obviously an over-simplification to regard
relevance as something which can be judged for each document independentlv
of the others: one might more reasonably expect the judgement on any one
document to be affected by which documents the judge has already seen
Ideally, one would try to devise an evaluation method which t[OCRerr]x)k this into
account; in practice, no such method has vet been used. In these