ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
Appendix A: The Smart System
appendix
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
A-2
sentence by sentence text image is available for additional-content
analysis. The principal processes available at this stage are phrase
identification procedures which may be based on an automatic syntactic
analysis or on a simple term-term co-occurrence detection scheme.
At the conclusion of the semantic coding process, the sentence
by sentence text image is compressed into a weighted property vector
index image. Property weights are derived by a summation over the
encoded text image so that the weight of a given component of the vector
index image of a document is representative of the frequency of
occurrence in the document of the features mapped into that component.
To reflect the multiple mappings incorporated in the thesaurus
transformation, each input'term is mapped with a constant total weight.
Thus a' term which is encoded into a single thesaurus[OCRerr]category contributes
a weight w (w is a scale factor equal to 12 in the current system). If
the input term maps into k categories', each categ9ry receives a contri-
bution of w/k to its final weight. An occurrence of the term "band'1 of
Figure 2.1 (chapter 2),. for example, would contribute a weight of 12 to
category 30, while an occurrence of the term "carrier" would contribute
weights o£ 6 to concepts 61 and 316. This technique prevents ambiguous
term's (terms which map into several categories) from distorting the
concept weights of the final index vector. Wliile the property vector
is the primary index language' of the' system, a number of alternative
component weighting schemes are possible, including the option 0£
`ignoring all frequency derived information (which produces, in effect,
a set-represent,ed text imag'e).
½,