MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Other Potentially Related Research
chapter
Mary Elizabeth Stevens
National Bureau of Standards
or somewhat beyond. For the benefit of other research, it will also have
produced tapes of the true text of a large sample of natural-language ab-
stracts and a lexicon containing all the words of a corpus of current
scientific literature." 1/
6.7 Example of a Proposed Indexing-System Utilizing Related Research Techniques
In addition to the automatic assignment indexing and automatic classification
techniques for which experimental results have been reported, several other techniques
and programs have been proposed. One is the joint American Bar Association-IBM
research program (Eldridge and Dennis, 1963 [l82[OCRerr]), for which discussion has been
deferred because of its proposed use of several of the research techniques covered
previously in this section. The experimental corpus will consist of the full text of
approximately 5, 000 legal case reports taken chronologically from the Northeastern
Reporter. Approximately half of this material will be processed to obtain word frequency
counts. The frequencies will then be used to prepare for each different word an estimate
of the skewness of its distribution in the collection. The investigators will then personally
inspect the word list as ordered by skewness to divide it into "non-infor[OCRerr]ng" (Type I
words, or an exclusion list) and "informing" (Type II words, or an inclusion list) at some
appropriate cutting point. Then, for each document, a list will be prepared of its
"informing" (Type II) words, maintaining order within the document. For each pair of
such words, statistical association factors will be computed. Eldridge and Dennis
describe other aspects of their proposed technique, in part, as follows:
"For each document in the body of 2, 500 cases, a list will be prepared of its
Type U words, maintaining their original order within the document . . . For each
Type II word an `association factor' will be calculated for every other Type II word
with which it appears in any one document by compiling the probability that Word A
would appear this close to Word B this number of tries over the entire file, if
the Type II words were distributed at random. (This amounts to borrowing Stiles'
idea of the association factor, but implementing it with a numerical method which
takes into account nearness of the words within the document as well as [OCRerr]e fact
that they both occur in the same document. ) Since the factors are probabilities,
they will be numbers between zero and one . . . These numbers will be used to
estimate the distances between words in index-word space.
"The next step is to construct from the information about distances between pairs of
words an index-word space in which every word is at the correct (or approximately
correct) distance from every other word in the system with which it exhibits
association. The result of this operation can be visualized schematically as a sort
of grid in which every word can be placed in its appropriate position by assigning
it a set of coordinates."
1/
Melton, et al 1963 L4l4[OCRerr], f)p. [OCRerr]4-l5.
142