ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Indexing Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 2-3 based linguistic analysis procedures which provide an e££ective representation 0£ the iniormation content 0£ source documents without manual intervention at any sta[OCRerr] 0£ the process. In£ormation is conveyed [OCRerr]n the natural langua[OCRerr] by the variety 0£ semantic and structural constraints implicit in the langua[OCRerr]e. Machine indexi[OCRerr] techniques all depend, in e££ect, on the automatic reco[OCRerr]nition 0£ some set 0£ the in£ormation carryi[OCRerr] elements 0£ the natural language, and on the representation 0£ these elements in a £ormal structure. In [OCRerr]eneral, the processes 0£ automatic content analysis can be classified accordi[OCRerr] to whether they are statistically, semantically, or syntactically based. A discussion 0£ each classification £ollows. A. The Statistical Approach A natural starti[OCRerr] point £or statistical content analysis consists in assumi[OCRerr] that meani[OCRerr] is principally carriedby the words 6 used in a document. Under this assumption a suitable index trans£ormation consists inmappi[OCRerr] a document into an unordered set 0£ sigrii£icant content bearing words extracted £roni it. A variety 0£ statistical techniques have been proposed and investigated £or determining the most suitable set 0£ content words (keywords) to be used £or the encoding.7'8 Typically, such techniques generate a £requency count 0£ word types (ignoring most £unction words) and then invoke some £requency sensitive selection process to produce the document index image. Such procedures can, 0£ course, be extended in the9ry to the detection and counting 0£ word pairs, triples, etc.9