IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. (<omponents of the archetype t 7 be typical of all the relevant documents missed by A. But we can go some way towards minimizing this bias by making systems A and B as different as possible, and/or by using many different systems (B, C, . . .) to help find other relevant documents missed by A. We can also ask the requester, before putting the request to the system, for any relevant documents that he:shc already knows about. All such methods have limitations, ([OCRerr]ind unf[OCRerr])rtun([OCRerr]itely it is not kno[OCRerr]n, in general, how good they are. It seems likely, for instance. that there are some relevant documents that are never retrieved, and presumably have particul[OCRerr]r characteristics that are not detected by such methods. ltowe\ er no prictic[OCRerr]t1 alternatives exist. Obtaining relevance assessments The very vague definition of relevance given above (how closely the document matches the user's need) is certainly not sufficient as a basis for an experiment. What aspects of the process of obtaining relevance assessments do we need to consider in more detail? The first question is: Who is to make the assessments? In the ideal case, where the request is stimulated by a genuine information need, clearly the requester should be the one to decide on relevance. This may cause problems. since the requester may not be prepared to assess as manv documents as desired (for good experimental reasons) by the experimenter. In the past, many experiments have relied on third parties, particularly when assessments are required of documents not retrieved. The third party may act as substitute or as pre-selector for the requester him/herself in the matter of relevance This practice is regarded with increasing distrust, though it is hard in sonic cases to see any alternative. How reliable it is is not known Next, how much of the document should the relevance judge see bet[OCRerr]rc making a judgement? Again, the ideal is clearly the entire text ot thc document; but again, this is usually out of the question: usually titles or abstracts are used. There has been some work on the prediction of relevance (of full texts) on the basis of titles or abstracts, and it tends to show that titles alone are very bad indicators, abstracts are better but still leave a lot to be desired. It might be reasonable to postulate that for SOfl1( tests, such discrepancies will not matter too much, as they will affect all the svstem[OCRerr] being compared equally. But it remains just that a reasonable postulate The question of which documents should be judged has in effect been discussed above. One would often like the whole collection issessed, but this will usually be impossible. More likely, the judged set lor each que[OCRerr]' wi consist of the pooled output of various searches [OCRerr])n different svstems. including perhaps systems other than those under test, or possibly a sample from such a pool. The order in which the documents are presented to the judge may be important. In some sense it is obviously an over-simplification to regard relevance as something which can be judged for each document independentlv of the others: one might more reasonably expect the judgement on any one document to be affected by which documents the judge has already seen Ideally, one would try to devise an evaluation method which t[OCRerr]x)k this into account; in practice, no such method has vet been used. In these