ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Indexing Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
2-3
based linguistic analysis procedures which provide an e££ective
representation 0£ the iniormation content 0£ source documents without
manual intervention at any sta[OCRerr] 0£ the process. In£ormation is
conveyed [OCRerr]n the natural langua[OCRerr] by the variety 0£ semantic and
structural constraints implicit in the langua[OCRerr]e. Machine indexi[OCRerr]
techniques all depend, in e££ect, on the automatic reco[OCRerr]nition 0£
some set 0£ the in£ormation carryi[OCRerr] elements 0£ the natural language,
and on the representation 0£ these elements in a £ormal structure. In
[OCRerr]eneral, the processes 0£ automatic content analysis can be classified
accordi[OCRerr] to whether they are statistically, semantically, or
syntactically based. A discussion 0£ each classification £ollows.
A. The Statistical Approach
A natural starti[OCRerr] point £or statistical content analysis
consists in assumi[OCRerr] that meani[OCRerr] is principally carriedby the words
6
used in a document. Under this assumption a suitable index
trans£ormation consists inmappi[OCRerr] a document into an unordered set 0£
sigrii£icant content bearing words extracted £roni it. A variety 0£
statistical techniques have been proposed and investigated £or
determining the most suitable set 0£ content words (keywords) to be
used £or the encoding.7'8 Typically, such techniques generate a
£requency count 0£ word types (ignoring most £unction words) and then
invoke some £requency sensitive selection process to produce the
document index image. Such procedures can, 0£ course, be extended in
the9ry to the detection and counting 0£ word pairs, triples, etc.9