MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Appendix B: Progress and Prospects in Mechanized Indexing
appendix
Mary Elizabeth Stevens
National Bureau of Standards
Williams found an average correct classification of 62 percent for 474 test items
automatically assigned to one of four solid state categories 28/. In other tests, 2, 754
solid state abstracts were classified into three primary and three secondary categories,
using a computer program capable of handling up to 50 clue words, 10 subject categories,
and any number of documents. Performance effectiveness ranged from 62 to 88 percent
correct by comparison with the original classifications at the more generic level and from
67 to 92 percent correct at the more specific level.
Further progress in the application of statistical association, clumping and syntactic
analysis techniques have also been reported Statistical association techniques are
concerned with correlations and coefficients of similarity assumed to exist between items
or objects sharing common properties. In documentary item applications, document-
document similarities are calculated for sharings of the same index terms or for common
patterns of citing the same references, of being cited by the same other documents, and
the like. Word-association techniques include the development of absolute or relative fre-
quencies of co-occurrence in a given set of documents, such as those representative of a
specific subject matter field. Various normalizing procedures can be used to remove
effects of tendencies for certain words to occur frequently in general. Spiegel and asso-
ciates 38/ at Mitre Corporation have explored means of normalization to eliminate effects
of length of text strings, relative positions of words in a string, and vocabulary size.
Ernst 39/ reports that at Arthur D. Little: "We are ... seeking to provide a working
retrieval system which will incorporate associative features. The objective will be to
make use of automatically computed index term associations as a basis for detecting and
presenting an appropriate list of near-synonyms for the concepts desired by a user
essentially the automatic generation of a limited thesaurus in response to individual user
requests." In Switzer's model 40/, co-occurrence statistics of index terms consisting of
words from title or text, author's names, and words and author names from cited titles,
are used. Significant probabilities for such co-occurrences are then derived.
Methods that group objects or items in terms of co-occurrence data for their prop-
erties or characteristics are involved in the "clumping" techniques as proposed at the
Cambridge Language Research Unit. Further investigations into the development of the
basic CLRU approach have been conducted at the Linguistic Research Center at the Univer
sity of Texas, by Dale and others 30, 41/. In this work, simulation of associative doc-
ument retrieval by computer gave results for 260 computer abstracts, using the same 90
clue words as previously used by Borko: "The recall ratios in the test requests were high
(i.e. , very few relevant documents were not retrieved); relevance ratios were characteris-
tically smaller (of the order of 10 percent). However, since the output lists are ordered,
it is interesting to note that the relevance ratios are significantly much higher in the upper
portions of the output lists (roughly between 25 percent and 50 percent in the upper fourth
of the output lists), and that recall ratios are still of the order of 50-70 percent."
In 1964 a report of the Astropower Laboratory 42/ outlined a "semantic space
screening model" based on the assumptions that keywords or phrases have quantifiable
1values', that by itemizing the keywords in a document sufficient information is obtained
for its classification, and that by adding the values for the keywords in a document the
pertinence of that document to a particular subject field can be determined. A training
sample consisted of 120 abstracts drawn from six subfields of electrical engineering.
Results showed successful classification of source items, using four different classifica-
tion formulas, as ranging from 49 to 96.3 percent. Results with test items ranged from
32.9 to 69.0 percent accuracy.
229