MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
Related research efforts in more general areas of linguistic data processing suggest
inter-sentence distances as criteria for the selection of words and word groups in auto-
matic indexing and abstracting processes. In natural language text searching, for example,
the work of both Swanson (1960 [587], 1961 [586], 1963 [583]), and of Maron and Ray!,
suggests that limitation of searching to a four-sentence span would eliminate a number of
irrelevant responses to search requests specifying the joint occurrence of two or more
words.
Swanson's findings indicated that if two words or phrases contained in the sear[OCRerr]h
request were found in textual proximity within these limits, they were highly likely to bear
a semantic relationship that is what was intended by the requester. Applying the four-
sentence proximity criterion, it was found that the amount of irrelevant material retrieved
by the text searching system could be reduced by 60 percent without serious loss of
relevant information. 2/ Black cites the four-sentence proximity criterion and notes
further that it might be used also to retrieve only a paragraph or similar small portion of
the full text, reducing the amount of material to be read by the user, perhaps by as much
as 90 percent. 3/
Artandi, in her book-indexing studies, suggested as a topic for further investigation
the possibility that proximity of index term candidates as derived from the same section
of the text could serve to improve the quality of the indexing. Since her computer program
checks for duplicate potential entries occurring on the same page, this feature could be
used for further analysis, on the assumption that the number of occurrences of the same
entry for the same page is an indication of the importance of the discussion of the subject
on that page. 4/
3.3.5 Uses of Special Clues for Selection
Intra- and inter-sentence distances between words are relatively crude examples of
clues to selection of words and word-pairs which, because of their implied relationships,
may be especially significant for indexing, sentence extraction, or document categoriza-
tion. They can be quite readily detected by machine, but the implication that physical
proximity is a good measure of significant co-occurrence is often false. Other clues
which can be detected equally well, mechanically, are those which have to do with position
and format.
1/
2/
3/
Ray, 1961 [494], p. 92.
Swanson, 1963 [5831, p. 9, 1961 [586], pp.298-299.
See Black, 1963 [64], p.20 and footnote: "The figure 90 percent is derived from
experience in previous experiments, wherein the amount of relevant material
was scanned and a subjective judgment was formed that the relevant material was
actually about 10 percent of the total verbiage retrieved. That is, about 10 percent
of each document contained the relevant material; 90 percent of the document was
of no relevance but the document as a whole was relevant."
4/
Artandi, 1963 [20], p.47
84