MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Other Potentially Related Research chapter Mary Elizabeth Stevens National Bureau of Standards or somewhat beyond. For the benefit of other research, it will also have produced tapes of the true text of a large sample of natural-language ab- stracts and a lexicon containing all the words of a corpus of current scientific literature." 1/ 6.7 Example of a Proposed Indexing-System Utilizing Related Research Techniques In addition to the automatic assignment indexing and automatic classification techniques for which experimental results have been reported, several other techniques and programs have been proposed. One is the joint American Bar Association-IBM research program (Eldridge and Dennis, 1963 [l82[OCRerr]), for which discussion has been deferred because of its proposed use of several of the research techniques covered previously in this section. The experimental corpus will consist of the full text of approximately 5, 000 legal case reports taken chronologically from the Northeastern Reporter. Approximately half of this material will be processed to obtain word frequency counts. The frequencies will then be used to prepare for each different word an estimate of the skewness of its distribution in the collection. The investigators will then personally inspect the word list as ordered by skewness to divide it into "non-infor[OCRerr]ng" (Type I words, or an exclusion list) and "informing" (Type II words, or an inclusion list) at some appropriate cutting point. Then, for each document, a list will be prepared of its "informing" (Type II) words, maintaining order within the document. For each pair of such words, statistical association factors will be computed. Eldridge and Dennis describe other aspects of their proposed technique, in part, as follows: "For each document in the body of 2, 500 cases, a list will be prepared of its Type U words, maintaining their original order within the document . . . For each Type II word an `association factor' will be calculated for every other Type II word with which it appears in any one document by compiling the probability that Word A would appear this close to Word B this number of tries over the entire file, if the Type II words were distributed at random. (This amounts to borrowing Stiles' idea of the association factor, but implementing it with a numerical method which takes into account nearness of the words within the document as well as [OCRerr]e fact that they both occur in the same document. ) Since the factors are probabilities, they will be numbers between zero and one . . . These numbers will be used to estimate the distances between words in index-word space. "The next step is to construct from the information about distances between pairs of words an index-word space in which every word is at the correct (or approximately correct) distance from every other word in the system with which it exhibits association. The result of this operation can be visualized schematically as a sort of grid in which every word can be placed in its appropriate position by assigning it a set of coordinates." 1/ Melton, et al 1963 L4l4[OCRerr], f)p. [OCRerr]4-l5. 142