IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Search Matching Functions
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
111-29
[OCRerr]. SMART Test Results - Weighting Scheme
A) Description of Weighting Scheme
Weighted document and request vectors, rather than the binary ones
presented up to now in considering the overlap and cosine, may be constructed
by assigning to each content identifier a 1weight' that reflects the impor-
tance or usefulness of that identifier. Since the assignment of weights
is ideally done by automatic me[OCRerr]s, the weighting scheme in use with SMART
relies initially on frequency information. When suffix t[OCRerr]I and stem dic-
tionaries are used ,concepts are weighted entirely by frequency of occurrence
of the concepts in the documents (or requests): thus a concept that occurs
three times in a document will receive three times the weight of a concept
that appears only once.
With a thesaurus dictionary in use, or any dictionary that permits
a word to appear in more than one concept group, an additional adjustment of
the weight reflects word ambiguity. Thus, if a word appears in more than
one concept group it is assumed to be ambiguous, and the weight assigned
to the concept number representing the ambiguous word is decreased according
to the number of conc[OCRerr]pt groups in which the word appears. [OCRerr]any other modi-
fications to a weighting procedure of this type can be suggested; for example,
where abstracts and titles are used the title words may be given higher weights
than the abstract wordse
Both the overlap and cosine correlation coefficients may be used
with weighted vectors. For example, if a hypothetical request and document
are weighted as follows:
Concept a b c d ef ghijklmnopqrst u
Request Weight 1 2 1 1 13 12
Document Weight - - - 1 31 42137112351111 1