MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Assignment Indexing Techniques
chapter
Mary Elizabeth Stevens
National Bureau of Standards
The choice of 90 clue words in Borko's work with abstracts in the field of psycho-
logical literature was apparently dictated by a matrix size which would be convenient
for computer manipulation. 1/ However, it happened to coincide with the number of clue
words used by Maron in his experiments. Advantage was taken of this coincidence to
obtain comparative data on the performance of the two assignment[OCRerr]indexing techniques
as applied to the same material. The 260 computer literature abstracts used by Maron1
as source documents were processed to derive a correlation matrix for Maron's 90
manually selected words, which was then factor analyzed. Several sets of factors were
extracted, rotated, and the results studied, with a final selection of 21 categories
Since these automatically derived categories did not coincide with Maron 5 original
32, it was necessary to analyze manually the total group of 405 abstracts (260 "source"
and 145 "test" items) and assign them to the new categories, then to study the documents
falling into each factor-analytically derived category to determine which of Maron's 90
clue words were category-indicative, and finally to substitute these words in the Bayesian
equation used by Maron so as to predict which of these classification categories his
probabilistic method should obtain.
The same two sets of 260 "source" and 145 "new" abstracts used by Maron were then
submitted to the computer assignment program which compares the clue words of a new
item with the numeric values of the predictor words for each factor category, then com-
putes the score for each item in all categories, and assigns the category with the highest
score to the item. For the source items, Borko and Bernick's results showed 63.4
percent correctly classified, by comparison with the 84.6 percent correctness score
originally obtained for them in Maron' 5 experiments. For the new items the factor
analysis method scored 48.9 percent correct assignment by comparison with Maron's
original 51.8 percent. [OCRerr] The later investigators therefore concede that the performance
of Maron's technique was somewhat superior for the same items using the clue words
originally selected by Maron.
Further experimentation was then carried out (Borko and Bernick, 1963 [78]) using
word frequency data for the selection of a new set of 90 clue words and a classification
scheme for 21 categories was again automatically derived. The 405 abstracts were again
manually classified to these machine-derived categories by five subject-matter
specialists and the two investigators. Comparative data were then obtained for both the
Maron assignment formula and the modified classification system assignments in terms
of agreement with the manual assignments.
For the source items, the percentage of machine assignments agreeing with those
made by people was 62.7 when the Bayesian probability formula used by Maron was
applied and 61.2 for the factor analysis score system. For the new items, the
corresponding correct percentages were 57.9 and 55.9. Additional data compared the
effects of using the original Maron words and the frequency-based word set (Borko's
words) for the same probability formula assignment method. While there was an overlap
of approximately 50 percent between Maron's words and Borko's words, the findings
indicated that:
1/
2/
Now increased to 150 x 150.
BorkoandBernick, 1962[72], pp. 9-10.
96