ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Indexing Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
2-1[OCRerr]
the statistical approximations necessary in a context independent
£ramework must necessarily distort the characterization 0£ the
doc[OCRerr][OCRerr]ment1s content.
This su[OCRerr][OCRerr]ests that there are essentially two alternatives to
improvi[OCRerr] an already statistically optimized index trans£ormation.
One method clearly involves the incorporation o£ context dependent
recognition procedures into the content detection process. In some
sense, this is approximated by enco'di[OCRerr] lar[OCRerr]r se[OCRerr]ents 0£ the
natural la[OCRerr]ge text, e.[OCRerr]. phrases instead 0£ words, or sentences
instead of pbra[OCRerr]es. Alternatively, context dependence can be
introduced by multi-level re'co[OCRerr]nition procedures in which the
decision rules are altered by [OCRerr]lobal interpretation 0£ a context
£ree encodi[OCRerr], thereby produci[OCRerr] a second context dependent index
representation.
Consider [OCRerr]in a thesaurus trans£ormation o£ the[OCRerr] type
illustrated in'Fi[OCRerr]re"2.i. Assumingthat all ambi[OCRerr]ous input terms
(terms which map into more thanone thesaurus `cate[OCRerr]ry) are mapped[OCRerr].
with statistically derived wei[OCRerr]hts asdescribed above, one can expect
that the correct contezt[OCRerr]will be rein£orced'over all the term
encodi[OCRerr]s characterizi[OCRerr] the document, whereas the incorrect ones
will not. The term trchannel;t mapped as shown in Fi[OCRerr]e 2.1, is
initially associated with two alternative contexts. A£ter the
entire initial encodingis completed, it should be possible to derive
a total score £or 6ontext 11magnetic disic" vers[OCRerr] the context
"in£ormation transmission" by comparing `;he total weights 0£ all