ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Indexing Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
2-13
and lin[OCRerr][OCRerr]istic style, and due to the di££iculties 0£ extracti[OCRerr]
contextual in£ormation, any set 0£ properties chosen to encode the
in£ormation content 0£ documents or search requests in a [OCRerr]ven £ield
must re[OCRerr]lect statistical approximations over the usage 0£ the
detected £eatures. Such a statistical basis is clearly evident
in the statistical association indexi[OCRerr] model discussed earlier, where
it £orms an explicit part 0£ the index representation. In various
other indexing[OCRerr]schemes such as manual descriptor indexin[OCRerr], or in
mechanized thesaurus indexing, the statistical approximations are, in
e£fect, hidden in the decision rules incorporated in the index
trans£ormation. This necessary statistical basis £or document
content encoding is emphasized because 0£ its signi£icance in terms 0£
the problems 0£ [OCRerr]nerating, maintaining, and evaluating indexing
schemes.
Consider as a concrete example the indexing model
speci£ically assumed in this reporte The semantic associations
incorporated in the thesaurus mapping £rom word stems into thesaurus
or concept cate[OCRerr]ories can be established 9n an ad hoc basis, re£lecting
individual or collective value jud[OCRerr]ents. It is possible, however,
to subject these value[OCRerr]jud[OCRerr]uients to experimental veri£ication.
Assume, £or example, that a [OCRerr]ven set 0£ natural langnage terms
(words, phrases, etc.) is mapped into a single attribute 0£ the index
space, i.e. all the elements 0£ the set have been jud[OCRerr]ed to be
su££iciently associated so as to be treated as a unit in the index
language. It is [OCRerr] the occurrence 0£ this