Information Retrieval Experiment

IRE Information Retrieval Experiment An experiment: search strategy variations in SDI profiles chapter Lynn Evans Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Experiment 297 that it ensured that the relevance judgements were completely independent of the search strategy and the position of the document in any ranked output. The only question concerned the number of items from the different search strategy outputs that should be merged in the first place. The profiles in INSPEC's commercial SDI service, operating on subject interests similar to those of our experimental user group and with the same document collections, were at that time producing an average of 12-15 notifications per profile per week. With this figure as a guide it was decided that, for merging, the full output from the (optimum) boolean strategies should be taken with at least the top 25 items from each of the ranked-output strategies. Allowing for duplicates it was anticipated that the merged output would comprise at least 50 notifications per user per run. In those subject areas known to be more productive the full boolean output and the top 30, or even 40, items from the ranked-output strategies were merged. In fact over the total 8 runs the average weekly number of notifications sent to each member of the user group for assessment was 59. Figure 14.2 illustrates in broad outline the operation to the point where the `single set of notifications without duplicates' has been produced for despatching to the user for relevance assessments. The actual format of the notifications (6 in x 4 in cards) followed that used in the commercial INSPEC SDI service. They included the main bibliographic information (title, author, affiliation, source reference) plus all the free indexing terms and the main- entry classification codes. The user also received a summary card of the hit document numbers on which he indicated the relevance of each document notified. In making his relevance assessments the user was asked to apply a three- category relevance code1 as follows: 1-highly relevant documents; 2-partly relevant documents; X-non-relevant documents. To avoid misleading value judgements, the user was also requested to base his assessment purely on the subject matter and to ignore such things as the language of the original document, the quality of the journal in which it appeared, etc. This three-category code was deliberately chosen for its relative ease of use by the user. Highly relevant and completely non-relevant items are in general quite quickly assessed, with relevance category 2 providing a useful `dump' for the difficult or doubtful documents, e.g. those which the user is quite pleased to see but would not be concerned if they had not been retrieved. Other relevance categories have of course been proposed and used in document retrieval experiments. For example in evaluating operational systems it is useful to distinguish between relevant documents which the user has already seen before being notified via the system from those which are new to him. As a generalization it might be said that too in[OCRerr]nv C' categories are not advisable with 3 or 4 probably being the [OCRerr]ptimurn n A more fundamental issue than relevance [OCRerr]ategories is the whole [OCRerr] of relevance. Its nebulous nature has been emphasized incrcasing1[OCRerr] o[OCRerr] e[OCRerr] recent years even to the extent of raising it to the re[OCRerr]i[OCRerr][OCRerr] of phiIoso[OCRerr]hi[OCRerr] discourse. Nearly ten years ago Cooper'0 emphasized [OCRerr] dist;nctio[OCRerr] b L e[OCRerr] fi