Information Retrieval Experiment

IRE Information Retrieval Experiment An experiment: search strategy variations in SDI profiles chapter Lynn Evans Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Experiment 295 (1) T1[OCRerr]List of search terms The initial intention was that, excepting perhaps chemical compounds (e.g. gallium arsenide), the search terms should be strictly singlets. It was felt that any degree of pre-coordination would positively bias against some strategies particularly CT (plain co-ordination of terms). However even with simple CT there are difficulties, e.g. the free-index phrase `field effect transistor' may also appear as the abbreviation `FET' and of course both versions have to be catered for in searching. If the singlets-only rule is applied in the search profile then matching on the former produces a co-ordination level of 3 but matching on `FET' is at a level of 1 only. This anomaly could be avoided by including the term `FET' in the profile three times but this would be merely simulating weighting techniques; to do so would invalidate the comparison with strategy TWC (weighted list of terms). Other similar examples encountered were STEM/scanning transmission electron microscope, IMPATTlimpact avalanche transit time, and LEED/low energy electron diffraction. Some 65 per cent of the profiles included one or more non-singlet terms although, as a percentage of total search terms, non-singlets amounted to less than 5 per cent. On the other hand some pre-coordination of terms might positively favour some search strategies. For example, for the CT strategy, the concept `digital circuit*[OCRerr] would naturally be treated as two terms `digital' and [OCRerr]circuit*'. However for a boolean strategy it might be considered safer to search on the term `digital circuit*' rather than `digital' AND `circuit*' since experience has shown that the latter usually throws up a large number of false drops. Of course searching on `digital circuit*' will fail to match phrases like `digital logic circuit*[OCRerr]. In an operational system using a boolean strategy the final decision would probably rest entirely on which performance measure the user was more interested[OCRerr]recall or precision. Another difficulty encountered in preparing the basic list of search terms for a particular user statement was the problem of what to do with nebulous terms like [OCRerr]measur*[OCRerr], `propert*', [OCRerr]de5ign*[OCRerr], [OCRerr]observ*', etc. It was fairly certain that they could do no harm when used in a weighted list of terms (and might even improve the uniqueness of the ranked output) but their usefulness in a boolean search is not clear and as likely as not to be damaging. In the interests of retaining the same set of terms in all the search strategy variations for a particular query, some compromise had to be accepted occasionally in the final choice of terms used. (2) T2[OCRerr]Arrange terms into groups (concepts) In practice it was found that a large degree of latitude was possible in dividing the terms into concepts. At one extreme, for a user interested in `high power gas and liquid lasers', it might be argued that there are only two basic concepts involved, viz. a device (laser) and a characteristic (power). Alternatively it could be said that there are five separable concepts, viz. high, power, laser, gas, and liquid. The general policy pursued was to divide into as many concepts as possible. In fact the average number of concepts per user statement was 15, ranging from a minimum of 4 to a maximum of 25.