IRE
Information Retrieval Experiment
An experiment: search strategy variations in SDI profiles
chapter
Lynn Evans
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Experiment 295
(1) T1[OCRerr]List of search terms
The initial intention was that, excepting perhaps chemical compounds
(e.g. gallium arsenide), the search terms should be strictly singlets. It was
felt that any degree of pre-coordination would positively bias against
some strategies particularly CT (plain co-ordination of terms). However
even with simple CT there are difficulties, e.g. the free-index phrase `field
effect transistor' may also appear as the abbreviation `FET' and of course
both versions have to be catered for in searching. If the singlets-only rule
is applied in the search profile then matching on the former produces a
co-ordination level of 3 but matching on `FET' is at a level of 1 only. This
anomaly could be avoided by including the term `FET' in the profile
three times but this would be merely simulating weighting techniques; to
do so would invalidate the comparison with strategy TWC (weighted list
of terms). Other similar examples encountered were STEM/scanning
transmission electron microscope, IMPATTlimpact avalanche transit
time, and LEED/low energy electron diffraction.
Some 65 per cent of the profiles included one or more non-singlet terms
although, as a percentage of total search terms, non-singlets amounted to
less than 5 per cent.
On the other hand some pre-coordination of terms might positively
favour some search strategies. For example, for the CT strategy, the
concept `digital circuit*[OCRerr] would naturally be treated as two terms `digital'
and [OCRerr]circuit*'. However for a boolean strategy it might be considered
safer to search on the term `digital circuit*' rather than `digital' AND
`circuit*' since experience has shown that the latter usually throws up a
large number of false drops. Of course searching on `digital circuit*' will
fail to match phrases like `digital logic circuit*[OCRerr]. In an operational system
using a boolean strategy the final decision would probably rest entirely on
which performance measure the user was more interested[OCRerr]recall or
precision.
Another difficulty encountered in preparing the basic list of search terms
for a particular user statement was the problem of what to do with
nebulous terms like [OCRerr]measur*[OCRerr], `propert*', [OCRerr]de5ign*[OCRerr], [OCRerr]observ*', etc. It was
fairly certain that they could do no harm when used in a weighted list of
terms (and might even improve the uniqueness of the ranked output) but
their usefulness in a boolean search is not clear and as likely as not to be
damaging.
In the interests of retaining the same set of terms in all the search strategy
variations for a particular query, some compromise had to be accepted
occasionally in the final choice of terms used.
(2) T2[OCRerr]Arrange terms into groups (concepts)
In practice it was found that a large degree of latitude was possible in
dividing the terms into concepts. At one extreme, for a user interested in
`high power gas and liquid lasers', it might be argued that there are only
two basic concepts involved, viz. a device (laser) and a characteristic
(power). Alternatively it could be said that there are five separable
concepts, viz. high, power, laser, gas, and liquid.
The general policy pursued was to divide into as many concepts as
possible. In fact the average number of concepts per user statement was
15, ranging from a minimum of 4 to a maximum of 25.