ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval Appendix A: The Smart System appendix Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. A-2 sentence by sentence text image is available for additional-content analysis. The principal processes available at this stage are phrase identification procedures which may be based on an automatic syntactic analysis or on a simple term-term co-occurrence detection scheme. At the conclusion of the semantic coding process, the sentence by sentence text image is compressed into a weighted property vector index image. Property weights are derived by a summation over the encoded text image so that the weight of a given component of the vector index image of a document is representative of the frequency of occurrence in the document of the features mapped into that component. To reflect the multiple mappings incorporated in the thesaurus transformation, each input'term is mapped with a constant total weight. Thus a' term which is encoded into a single thesaurus[OCRerr]category contributes a weight w (w is a scale factor equal to 12 in the current system). If the input term maps into k categories', each categ9ry receives a contri- bution of w/k to its final weight. An occurrence of the term "band'1 of Figure 2.1 (chapter 2),. for example, would contribute a weight of 12 to category 30, while an occurrence of the term "carrier" would contribute weights o£ 6 to concepts 61 and 316. This technique prevents ambiguous term's (terms which map into several categories) from distorting the concept weights of the final index vector. Wliile the property vector is the primary index language' of the' system, a number of alternative component weighting schemes are possible, including the option 0£ `ignoring all frequency derived information (which produces, in effect, a set-represent,ed text imag'e). ½,