ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Indexing Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
2-14
property in some sample set 0£ document index representations, and to
manually determine £rom the context o£ each occurrence in the [OCRerr]ource
text whether the semantic value assumed in the index trans£ormation,
in £act, agrees with that £ound in the document. The degree to which
actual usag[OCRerr] con£orms to the associations assumed in the model can
provide con£irmation or su[OCRerr]gest cha[OCRerr]es.
One possibility is to incorporate such statistical evidence
directly into the thesaurus trans£ormation[OCRerr]by use 0£ a weighting
scheme. Consider the mapping shown in £ignre 2.1. The term "channel11
maps into two categories, one 0£ which, category 30, is associated
with magnetic disk storage, while the other, category 61, is associated
with in£ormation transmission. On the basis 0£ the statistics 0£ a
collection 0£ documents the a priori probabilities 0£ each 0£ these
usages can be estimated. Assume that the[OCRerr]category 30 context occurs
with relative £requency cc and that the category 61 usage occurs with
relative £requency 1-0'. The contribution 0£ n occurrences 0£ llchannelti
in a document will'then contribute an amount k[OCRerr]ncc to the resultant
weight 0£ category 30 and k[OCRerr]n[OCRerr]([OCRerr]-1) [OCRerr]o category 61 (where k is an
arbitrary scaling constant). In any event, assuming that such a
procedure or its equivalent is carried over a `su££iciently large
sample 0£ source text[OCRerr]to produce statistically signi£icant correlation
0£ the various associations incorporated into the index trans£ormation,
the index image o£ a particular document is still at best a good
approximation 0£ what' could be produced manually by applying a similar
set o£[OCRerr] context dependent rules. In other words the noise introduced by
I