ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Iv- 37
cases where all word stems included in the complete document abstract are
matched (full null), and where all word stems are used, but stems included
in document titles are weighted twice as heavily as other word stems (null
title 2). As can be seen there is not much to choose between these two
methods, although the increased title weights seem to perform slightly
better for high recall points. It should be noted that both 9f the
complete word matching procedures produce very high precision when the recall
is low. This reflects the fact that the documents which exhibit the highest
similarity with the search requests, and which therefore are retrieved early
in a given search operation - assuming that documents are retrieved in
decreasing order of similarity with the [OCRerr]earch requests - may be expected
to be almost all relevant to the given request. Or, differently expressed,
a word matching procedure will be useful if the requestor desires to see
only a few documents, and does not insist on obtaining everything that is
relevant within a given collection. The more sophisticated thesaurus
procedures may then be expected to be useful mainly for the purpose of
raising the precision for high recall vajues, that is, to retrieve documents
which cannot be inuediately obtained by a word matching process.
Fig. 10 shows that the word matching procedure which assigns weights
to the stems in proportion to their frequency within a given document
(full null) is much more effective than the equivalent matching process
in which weights are disregarded (null logvec). The logical vector
process is one where each word stem is assigned the same weight, namely
1, and no distinction is made between more and less important stems.
To summarize then, the word stem matching procedure performs best
when all word stems are used from null document abstracts, or full documents,