MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Assignment Indexing Techniques
chapter
Mary Elizabeth Stevens
National Bureau of Standards
The original Borko approach was based on the principles of factor analysis as these
had been developed for the analysis of multivariate date, especially in the field of
psychology. Borko's first experiments were directed to a corpus consisting of 618
abstracts in the field of psychology, amounting to approximately 50, 000 words of total
text and 6, 800 different words. These words were sorted by computer program into an
order reflecting their respective frequencies of occurrence. For the approximately zoo
words that occurred twenty or more times in this corpus, the investigator himself
selected 90 words to serve as index (or, better, index-clue) terms. A matrix was then
developed for the frequencies of co-occurrence of these words and the documents in which
they appeared. From this, a 90 x 90 correlation matrix was computed as follows:
11To compute the correlation coefficient . . . we used the following formula
r = N[OCRerr]xy - (Lx) ([OCRerr]y)
xy /[N[OCRerr]x2 - ([OCRerr]x)2] [[OCRerr][OCRerr]yZ - (Zy)2 [OCRerr]
Where N is equal to the number of documents (618) and x and y are the terms being
correlated." 1/
The term-correlation matrix was then factor analyzed and the first ten eigenvectors
were selected as factors to be rotated and interpreted. Borko emphasizes that:
"The interpretation must be made by the investigator and is based upon his knowledge
of the analytic procedures and the subject matter. There is, therefore, a degree of
subjectivity in the names selected for each factor. These names may be regarded
as hypotheses about the factor meaning." 2/
Following the derivation of these "classification categories'1 by means of the factor
analysis technique, new items may be assigned to the categories on the basis of words
occurring in their texts (abstracts) in accordance with the following procedural steps:
"1. Each document, in machine readable form, is analyzed by the computer.
A list of the index terms and their frequencies of occurrence in each document
is recorded.
"2. The category or categories containing the index term is assigned a value equal
to the product of the number of occurrences of the word in the abstract and the
normalized factor loading of the word in the category. If more than one index term
appears in a category, the products are summed.
"3. Mter each index term has been considered, the category having the highest
numerical value is selected." 3/
1/
2/
3/
Borko, 1961 E73[OCRerr], p. 283.
Ibid, pp. 285-286.
Borko and Bernick, 1962 L77], pp. 7-8.
95