MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Assignment Indexing Techniques
chapter
Mary Elizabeth Stevens
National Bureau of Standards
In addition to his work on probabilistic indexing with emphasis on relevance
weightings for index tags manually assigned, Maron has actively explored automatic
assignment indexing chniques. The approach is also probabilistic, with emphasis on
the statistics of asso[OCRerr]iation between content-indicative clue words and subject headings
manually assigned to sample documents. The experimental corpus consisted of a group
of abstracts in the field of computer technology indexed to 32 subject categories designed
for the purposes of these investigations.
Common words such as articles and prepositions were first excluded. Next, words
occurring less than three times were purged and words such as "data" and "computer"
were also rejected because they occur so frequently in this literature. Approximately
1,000 words remained after these purging operations. After sorting the source docu-
ments to their most appropriate subject categories, statistical frequencies were
obtained for the co-occurrences of the candidate clue-words with the categories and the
resulting listings were manually examined to determine which words peaked in a
particular category. Eventually, 90 such words were selected.
The occurrence of one or more of the 90 clue-words in the text of new documents was
then used to predict the subject category to which the new item should belong. I' Tests
were run with two groups of documents, one consisting of the source items from which
the statistical frequency and word list data had been obtained, and the second group
consisting of 145 genuinely new items. For the latter group, twenty documents contained
no clue words whatever and forty items had only one. For the remaining 85 items having
two or more clue words, the results of the computer assignment program were predic-
tions of the correct category in 44, or 51.8 percent, of the cases.[OCRerr]1 Results using the
source documents were significantly better, as expected, with 84.6 percent accuracy of
category prediction for 247 items. Results were also related to the number of clue words
that occurred in the test items, with a prediction accuracy of only 48.7 percent for items
with a single clue word rising to 100 percent probability of correct assignment if six or
more clue words occurred.
Trachtenberg (1963 [608]) has also considered a probabilistic approach to automatic
indexing and categorization of documents, similar to that of Maron He suggests the
investigation of two information theoretic measures with reference to determination of
which of various possible clue words are significantly discriminating with respect to the
different categories. He further suggests experiments using 90 clue words and the
corpus used by both Maron and Borko, but no actual results have as yet been reported.
4.3 Automatic Indexing Investigations of Borko and Bernick
At the System Development Corporation, the work of Borko (1960 [73]), and of
Borko and Bernick (1962 [77], 1963 [78], 1964 [79]) in the area of automatic indexing
has involved both automatic assignment indexing and automatic classification techniques.
They have not only reported actual indexing results but have provided data for the inter-
comparison of their techniques with the experiments of Maron for the same source
material.
1/
2I
Note that the word itself is not necessarily used as an index tag or label, as is the
case for derivative indexing using an inclusion list approach. This is an important
distinction.
Maron, 1961 [395], p. 257.
94