ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Indexing Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 2-14 property in some sample set 0£ document index representations, and to manually determine £rom the context o£ each occurrence in the [OCRerr]ource text whether the semantic value assumed in the index trans£ormation, in £act, agrees with that £ound in the document. The degree to which actual usag[OCRerr] con£orms to the associations assumed in the model can provide con£irmation or su[OCRerr]gest cha[OCRerr]es. One possibility is to incorporate such statistical evidence directly into the thesaurus trans£ormation[OCRerr]by use 0£ a weighting scheme. Consider the mapping shown in £ignre 2.1. The term "channel11 maps into two categories, one 0£ which, category 30, is associated with magnetic disk storage, while the other, category 61, is associated with in£ormation transmission. On the basis 0£ the statistics 0£ a collection 0£ documents the a priori probabilities 0£ each 0£ these usages can be estimated. Assume that the[OCRerr]category 30 context occurs with relative £requency cc and that the category 61 usage occurs with relative £requency 1-0'. The contribution 0£ n occurrences 0£ llchannelti in a document will'then contribute an amount k[OCRerr]ncc to the resultant weight 0£ category 30 and k[OCRerr]n[OCRerr]([OCRerr]-1) [OCRerr]o category 61 (where k is an arbitrary scaling constant). In any event, assuming that such a procedure or its equivalent is carried over a `su££iciently large sample 0£ source text[OCRerr]to produce statistically signi£icant correlation 0£ the various associations incorporated into the index trans£ormation, the index image o£ a particular document is still at best a good approximation 0£ what' could be produced manually by applying a similar set o£[OCRerr] context dependent rules. In other words the noise introduced by I