IRE
Information Retrieval Experiment
An experiment: search strategy variations in SDI profiles
chapter
Lynn Evans
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Experiment 297
that it ensured that the relevance judgements were completely independent
of the search strategy and the position of the document in any ranked output.
The only question concerned the number of items from the different search
strategy outputs that should be merged in the first place.
The profiles in INSPEC's commercial SDI service, operating on subject
interests similar to those of our experimental user group and with the same
document collections, were at that time producing an average of 12-15
notifications per profile per week. With this figure as a guide it was decided
that, for merging, the full output from the (optimum) boolean strategies
should be taken with at least the top 25 items from each of the ranked-output
strategies. Allowing for duplicates it was anticipated that the merged output
would comprise at least 50 notifications per user per run. In those subject
areas known to be more productive the full boolean output and the top 30, or
even 40, items from the ranked-output strategies were merged. In fact over
the total 8 runs the average weekly number of notifications sent to each
member of the user group for assessment was 59.
Figure 14.2 illustrates in broad outline the operation to the point where the
`single set of notifications without duplicates' has been produced for
despatching to the user for relevance assessments. The actual format of the
notifications (6 in x 4 in cards) followed that used in the commercial INSPEC
SDI service. They included the main bibliographic information (title, author,
affiliation, source reference) plus all the free indexing terms and the main-
entry classification codes. The user also received a summary card of the hit
document numbers on which he indicated the relevance of each document
notified.
In making his relevance assessments the user was asked to apply a three-
category relevance code1 as follows:
1-highly relevant documents;
2-partly relevant documents;
X-non-relevant documents.
To avoid misleading value judgements, the user was also requested to base
his assessment purely on the subject matter and to ignore such things as the
language of the original document, the quality of the journal in which it
appeared, etc.
This three-category code was deliberately chosen for its relative ease of use
by the user. Highly relevant and completely non-relevant items are in general
quite quickly assessed, with relevance category 2 providing a useful `dump'
for the difficult or doubtful documents, e.g. those which the user is quite
pleased to see but would not be concerned if they had not been retrieved.
Other relevance categories have of course been proposed and used in
document retrieval experiments. For example in evaluating operational
systems it is useful to distinguish between relevant documents which the user
has already seen before being notified via the system from those which are
new to him. As a generalization it might be said that too in[OCRerr]nv C'
categories are not advisable with 3 or 4 probably being the [OCRerr]ptimurn n
A more fundamental issue than relevance [OCRerr]ategories is the whole [OCRerr]
of relevance. Its nebulous nature has been emphasized incrcasing1[OCRerr] o[OCRerr] e[OCRerr]
recent years even to the extent of raising it to the re[OCRerr]i[OCRerr][OCRerr] of phiIoso[OCRerr]hi[OCRerr]
discourse. Nearly ten years ago Cooper'0 emphasized [OCRerr] dist;nctio[OCRerr] b L e[OCRerr] fi