MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards Related research efforts in more general areas of linguistic data processing suggest inter-sentence distances as criteria for the selection of words and word groups in auto- matic indexing and abstracting processes. In natural language text searching, for example, the work of both Swanson (1960 [587], 1961 [586], 1963 [583]), and of Maron and Ray!, suggests that limitation of searching to a four-sentence span would eliminate a number of irrelevant responses to search requests specifying the joint occurrence of two or more words. Swanson's findings indicated that if two words or phrases contained in the sear[OCRerr]h request were found in textual proximity within these limits, they were highly likely to bear a semantic relationship that is what was intended by the requester. Applying the four- sentence proximity criterion, it was found that the amount of irrelevant material retrieved by the text searching system could be reduced by 60 percent without serious loss of relevant information. 2/ Black cites the four-sentence proximity criterion and notes further that it might be used also to retrieve only a paragraph or similar small portion of the full text, reducing the amount of material to be read by the user, perhaps by as much as 90 percent. 3/ Artandi, in her book-indexing studies, suggested as a topic for further investigation the possibility that proximity of index term candidates as derived from the same section of the text could serve to improve the quality of the indexing. Since her computer program checks for duplicate potential entries occurring on the same page, this feature could be used for further analysis, on the assumption that the number of occurrences of the same entry for the same page is an indication of the importance of the discussion of the subject on that page. 4/ 3.3.5 Uses of Special Clues for Selection Intra- and inter-sentence distances between words are relatively crude examples of clues to selection of words and word-pairs which, because of their implied relationships, may be especially significant for indexing, sentence extraction, or document categoriza- tion. They can be quite readily detected by machine, but the implication that physical proximity is a good measure of significant co-occurrence is often false. Other clues which can be detected equally well, mechanically, are those which have to do with position and format. 1/ 2/ 3/ Ray, 1961 [494], p. 92. Swanson, 1963 [5831, p. 9, 1961 [586], pp.298-299. See Black, 1963 [64], p.20 and footnote: "The figure 90 percent is derived from experience in previous experiments, wherein the amount of relevant material was scanned and a subjective judgment was formed that the relevant material was actually about 10 percent of the total verbiage retrieved. That is, about 10 percent of each document contained the relevant material; 90 percent of the document was of no relevance but the document as a whole was relevant." 4/ Artandi, 1963 [20], p.47 84