MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Automatic Assignment Indexing Techniques chapter Mary Elizabeth Stevens National Bureau of Standards The original Borko approach was based on the principles of factor analysis as these had been developed for the analysis of multivariate date, especially in the field of psychology. Borko's first experiments were directed to a corpus consisting of 618 abstracts in the field of psychology, amounting to approximately 50, 000 words of total text and 6, 800 different words. These words were sorted by computer program into an order reflecting their respective frequencies of occurrence. For the approximately zoo words that occurred twenty or more times in this corpus, the investigator himself selected 90 words to serve as index (or, better, index-clue) terms. A matrix was then developed for the frequencies of co-occurrence of these words and the documents in which they appeared. From this, a 90 x 90 correlation matrix was computed as follows: 11To compute the correlation coefficient . . . we used the following formula r = N[OCRerr]xy - (Lx) ([OCRerr]y) xy /[N[OCRerr]x2 - ([OCRerr]x)2] [[OCRerr][OCRerr]yZ - (Zy)2 [OCRerr] Where N is equal to the number of documents (618) and x and y are the terms being correlated." 1/ The term-correlation matrix was then factor analyzed and the first ten eigenvectors were selected as factors to be rotated and interpreted. Borko emphasizes that: "The interpretation must be made by the investigator and is based upon his knowledge of the analytic procedures and the subject matter. There is, therefore, a degree of subjectivity in the names selected for each factor. These names may be regarded as hypotheses about the factor meaning." 2/ Following the derivation of these "classification categories'1 by means of the factor analysis technique, new items may be assigned to the categories on the basis of words occurring in their texts (abstracts) in accordance with the following procedural steps: "1. Each document, in machine readable form, is analyzed by the computer. A list of the index terms and their frequencies of occurrence in each document is recorded. "2. The category or categories containing the index term is assigned a value equal to the product of the number of occurrences of the word in the abstract and the normalized factor loading of the word in the category. If more than one index term appears in a category, the products are summed. "3. Mter each index term has been considered, the category having the highest numerical value is selected." 3/ 1/ 2/ 3/ Borko, 1961 E73[OCRerr], p. 283. Ibid, pp. 285-286. Borko and Bernick, 1962 L77], pp. 7-8. 95