4.   AUTOMATIC ASSIGNMENT INDEXING TECHNIQUES

     Answers to the question of whether indexing by machine is possible are actually
dependent in part on how the question of whether what can be achieved by machine is or
is not properly termed "indexing" is answered.   If "indexing" is defined as being more
than the mere extraction of words from titles, abstracts, or text, then automatic
derivative indexing, even when augmented by various modifications, normalizations, and
editings, does not provide affirmative evidence.  In the case of concept-oriented
definitions of indexing, the question becomes one of whether or not automatic assignment
indexing is possible.  Experimental evidence suggesting that it is will be presented in this
section.

     We should note first, however, that just as there are differences of opinion as to
what "indexing" means so there are similar differences, with respect to whether or not
it represents concepts rather than extracted words.   There are also a number of conflict-
ing definitions of what is meant by "indexing" in contradistinction to "classifying".  For
some, the latter difference is related to questions of the number of labels or surrogates
assigned to a single item to represent its subject contents, ranging from the assignment
of a single subject category in a classification scheme involving mutually exclusive
classes to the assignment of a number of terms or descriptor each standing for one of a
number of aspects of the subject.  For our purposes, however, we shall regard both the
case of indexing with a number of descriptors and that of classifying to a single category
or subject heading as being within the province of automatic assignment indexing, re-
serving the term "automatic classification" for the case where the machine is used to
establish the classification or categorization scheme itself.

     Actual experiments in automatic assignment indexing by Borko, Borko and Bernick,
Maron, Salton, Stevens and Urban, Swanson, and Williams will be discussed briefly
below.  These discussions are generally in chronological order with respect to first
reporting of results, except that the Salton-Lesk-Storm work reflects a somewhat dif-
ferent principle of assignment from the methods using clue word approaches and it is
therefore described after these others have been discussed.  Some of the similarities and
differences between the various methods are then indicated.  A brief final subsection
covers related assignment indexing proposals for which experimental data is not available
or has not as yet been reported in the literature.


4.1  Swanson and Later Work at Thompson Ramo-Wooldridge

     Research on fully automatic indexing as well as on full text searching and retrieval
at the Ramo-Wooldridge Corporation has been reported as being under way at least as
early as the spring of 1958. 1/ As described elsewhere in this report, experiments in
search and retrieval based upon full natural language text had used as test items short
articles in the field of nuclear physics. In additional experiments representing a
preliminary "clue word" approach to possibilities for automatic indexing procedures,
some of this same material was used.


1/
     National Science Foundation's CR&D rept. no. 2, [430], p. 32.

    In these additional experiments, 27 articles in the nuclear physics subject area were
included in a corpus of 100 articles, the remainder covering a variety of topics.  Fre-
quency counts of word occurrences for the physics material were obtained and the 12 most
frequent words that were judged to be discriminatory for the subject were selected.  The
hypothesis was then tested, that if any document pertained to nuclear physics it would
contain at least two of these words.   Retrieval was achieved for 25 of the 27 documents
and the two "irrelevant'1 documents also retrieved did include information at least peri-
pherally related to the `subject.  It was thus evident that the retrieval effectiveness of
automatic recognition of nuclear physics subject material in the general collection was
considerably greater than the average effectiveness of retrieving responses to the highly
specific search questions in nuclear physics that had been used in the full text searching
experiments (Swanson, 1961 [586]).

    This second set of experiments provided a transition from the full text searching
work, which if it can be considered indexing at all is obviously derivative indexing, to
work in the application of an automatic assignment indexing method to 1, 200 newspaper
clippings (Swanson, 1962 [ 584], 1963 [580]).    These were brief news items for which
machine-readable texts in the form of punched paper tape were available.    Thesaurus-
groups of words likely to be associated with each of 20  to 24 subject headings were first
compiled on the basis of human analysis of 1,000 or more representative items.     These
word groups were further screened so that no word appeared in more than one group and
so that each word retained should be uniquely indicative of the particular subject
category.  In the machine assignment procedure, subsequently, if a word occurs that
belongs to a particular thesaurus group, the corresponding subject heading is assigned
to the item in which that word occurs.

    Results achieved with this technique appear to be highly promising, at least for this
type of material. Swanson reports as follows:

    "Approximately 1,200 brief news items were classified into 20 nonhierarchical
    subject categories, both by a human and a machine procedure.      Each item was
    assigned on the average to about four categories.   The results of the two
    processes were compared.       With the human process as a standard, the machine
    missed only seven percent of the correct subject assignments and made a number
    of irrelevant assignments equal to about 17 percent of the total.  Nearly 40 per-
    cent of the automatic subject assignments judged finally to be correct were
    missed by the human catalogers.

    While this accomplishment is actually due to the extensive human effort to compiling,
organizing, and pruning of the uniquely indivative word lists, it is pointed out that this
intellectual effort and the programming tasks need to be done only "once and for all".
It is further pointed out that garbles or misspellings in the input text do not appear to
affect the procedure, there being enough redundancy in the messages so that even if one or
two clue words are missed, others will be present. 3/


1/
    Swanson, 1962 L584], p.468.

2/  Ibid, p.469.


    Swanson, 1963   [ 580], p.5


                                             92

    Swanson and his TRW associates have further proposed extensions of the prespecified
unique clue-word technique.   For example, it is suggested that machine processes of
comparing words of titles, subtitles and chapter headings to lists of possible subject
heading can be extended in sophistication by machine lookups of synonym groups and of
characteristic subject-word associations. 1/ Frequency weightings may be taken into
account, and similar measures of association and subj~~t-indicativeness may be
developed for phrases as well as for individual words. -   In general, however, the
apparent success of this clue-word technique in tests to date should be considered in the
light of the special character of the items, their extreme brevity, and the high probability
that the fact-word incidence involved in news reporting is not typical of less popular and
less factually oriented materials  3/

    Continuing work along similar lines has been carried forward at Ramo-Wooldridge in
the `1Word Correlation and Automatic Indexing Program!! sponsored by the Council on
Library Resources (1959 [490] and [491]).    Here, the objectives are to develop and apply
clue-word techniques to material that is much more representative of the scientific and
technical literature.   The thesaurus~groups, now called `1indexonym" groups, are made up
of words and phrases selected by extensive human analysis as being significantly "useful-
for- retrieval~purposes".

    New items would be processed in a word and phrase lookup operation, with each word
or phrase being initially assigned the identifier number codes of all groups to which it
belongs.   However, unless a particular group 5 number is repeated several times within
the space of a few paragraphs, it is not used as the basis for the actual assignment of an
index tag.  Provision would be made for calling human attention to items having a numbe
of words that are not deleted by processing against a "useless-for-retrieval purposes"
list, but that are not found in any of "accepted" groups.  It is suggested that in this way it
should be possible to "ascribe measures of automatically recognizable `newness' to
technical articles!!. 4/

4.2 Maron's Automatic Indexing Experiments

    By April of 1959, the reports of work at Thompson Ramo-Wooldridge on automatic
indexing and related problems submitted for the Current Research and Development in
Scientific Documentation series included reference to Maron and a "probabilistic model for
the assignment of index tags", as well as to Swanson's continuing projects. 5/


1/


2/


3/


4/

5/

    Swanson, 1962 [584], p. 469.


    Swanson, 1963[580], pp. 1-2.


    See also Mooers, 1963 [424].


    Thompson Ramo Wooldridge, 1959 [491], p. 2A.


    National Science Foundation's CR&D report No. 5 [430], p.34.


                                       93

    In addition to his work on probabilistic indexing with emphasis on relevance
weightings for index tags manually assigned, Maron has actively explored automatic
assignment indexing    chniques.  The approach is also probabilistic, with emphasis on
the statistics of asso~iation between content-indicative clue words and subject headings
manually assigned to sample documents.    The experimental corpus consisted of a group
of abstracts in the field of computer technology indexed to 32 subject categories designed
for the purposes of these investigations.


    Common words such as articles and prepositions were first excluded.     Next, words
occurring less than three times were purged and words such as "data" and "computer"
were also rejected because they occur so frequently in this literature.  Approximately
1,000 words remained after these purging operations.  After sorting the source docu-
ments to their most appropriate subject categories, statistical frequencies were
obtained for the co-occurrences of the candidate clue-words with the categories and the
resulting listings were manually examined to determine which words peaked in a
particular category.   Eventually, 90 such words were selected.

    The occurrence of one or more of the 90 clue-words in the text of new documents was
then used to predict the subject category to which the new item should belong. I'   Tests
were run with two groups of documents, one consisting of the source items from which
the statistical frequency and word list data had been obtained, and the second group
consisting of 145 genuinely new items.  For the latter group, twenty documents contained
no clue words whatever and forty items had only one.  For the remaining 85 items having
two or more clue words, the results of the computer assignment program were predic-
tions of the correct category in 44, or 51.8 percent, of the cases.~1 Results using the
source documents were significantly better, as expected, with 84.6 percent accuracy of
category prediction for 247 items.  Results were also related to the number of clue words
that occurred in the test items, with a prediction accuracy of only 48.7 percent for items
with a single clue word rising to 100 percent probability of correct assignment if six or
more clue words occurred.

    Trachtenberg (1963 [608]) has also considered a probabilistic approach to automatic
indexing and categorization of documents, similar to that of Maron     He suggests the
investigation of two information theoretic measures with reference to determination of
which of various possible clue words are significantly discriminating with respect to the
different categories.  He further suggests experiments using 90 clue words and the
corpus used by both Maron and Borko, but no actual results have as yet been reported.


4.3 Automatic Indexing Investigations of Borko and Bernick

    At the System Development Corporation, the work of Borko (1960 [73]), and of
Borko and Bernick (1962 [77], 1963 [78], 1964 [79]) in the area of automatic indexing
has involved both automatic assignment indexing and automatic classification techniques.
They have not only reported actual indexing results but have provided data for the inter-
comparison of their techniques with the experiments of Maron for the same source
material.


1/


2I

    Note that the word itself is not necessarily used as an index tag or label, as is the
    case for derivative indexing using an inclusion list approach.  This is an important
    distinction.


    Maron, 1961 [395], p. 257.


                                      94

     The original Borko approach was based on the principles of factor analysis as these
had been developed for the analysis of multivariate date, especially in the field of
psychology.  Borko's first experiments were directed to a corpus consisting of 618
abstracts in the field of psychology, amounting to approximately 50, 000 words of total
text and 6, 800 different words.  These words were sorted by computer program into an
order reflecting their respective frequencies of occurrence.  For the approximately zoo
words that occurred twenty or more times in this corpus, the investigator himself
selected 90 words to serve as index (or, better, index-clue) terms.  A matrix was then
developed for the frequencies of co-occurrence of these words and the documents in which
they appeared.   From this, a 90 x 90 correlation matrix was computed as follows:

     11To compute the correlation coefficient . . . we used the following formula


     r  =  N~xy - (Lx) (~y)
      xy   /[N~x2 - (~x)2] [~~yZ - (Zy)2 ~


     Where N is equal to the number of documents (618) and x and y are the terms being
     correlated." 1/

     The term-correlation matrix was then factor analyzed and the first ten eigenvectors
were selected as factors to be rotated and interpreted.  Borko emphasizes that:

     "The interpretation must be made by the investigator and is based upon his knowledge
     of the analytic procedures and the subject matter.  There is, therefore, a degree of
     subjectivity in the names selected for each factor. These names may be regarded
     as hypotheses about the factor meaning." 2/

     Following the derivation of these "classification categories'1 by means of the factor
analysis technique, new items may be assigned to the categories on the basis of words
occurring in their texts (abstracts) in accordance with the following procedural steps:

     "1. Each document, in machine readable form, is analyzed by the computer.
     A list of the index terms and their frequencies of occurrence in each document
     is recorded.

     "2. The category or categories containing the index term is assigned a value equal
     to the product of the number of occurrences of the word in the abstract and the
     normalized factor loading of the word in the category.  If more than one index term
     appears in a category, the products are summed.

     "3. Mter each index term has been considered, the category having the highest
     numerical value is selected." 3/


1/


2/

3/

    Borko, 1961 E73~, p. 283.


    Ibid, pp. 285-286.


    Borko and Bernick, 1962 L77], pp. 7-8.


                                         95

    The choice of 90 clue words in Borko's work with abstracts in the field of psycho-
logical literature was apparently dictated by a matrix size which would be convenient
for computer manipulation. 1/ However, it happened to coincide with the number of clue
words used by Maron in his experiments.    Advantage was taken of this coincidence to
obtain comparative data on the performance of the two assignment~indexing techniques
as applied to the same material.  The 260 computer literature abstracts used by Maron1
as source documents were processed to derive a correlation matrix for Maron's 90
manually selected words, which was then factor analyzed.   Several sets of factors were
extracted, rotated, and the results studied, with a final selection of 21 categories

    Since these automatically derived categories did not coincide with Maron 5 original
32, it was necessary to analyze manually the total group of 405 abstracts (260 "source"
and 145 "test" items) and assign them to the new categories, then to study the documents
falling into each factor-analytically derived category to determine which of Maron's 90
clue words were category-indicative, and finally to substitute these words in the Bayesian
equation used by Maron so as to predict which of these classification categories his
probabilistic method should obtain.

    The same two sets of 260 "source" and 145 "new" abstracts used by Maron were then
submitted to the computer assignment program which compares the clue words of a new
item with the numeric values of the predictor words for each factor category, then com-
putes the score for each item in all categories, and assigns the category with the highest
score to the item.  For the source items, Borko and Bernick's results showed 63.4
percent correctly classified, by comparison with the 84.6 percent correctness score
originally obtained for them in Maron' 5 experiments.  For the new items the factor
analysis method scored 48.9 percent correct assignment by comparison with Maron's
original 51.8 percent. ~ The later investigators therefore concede that the performance
of Maron's technique was somewhat superior for the same items using the clue words
originally selected by Maron.

    Further experimentation was then carried out (Borko and Bernick, 1963 [78]) using
word frequency data for the selection of a new set of 90 clue words and a classification
scheme for 21 categories was again automatically derived.   The 405 abstracts were again
manually classified to these machine-derived categories by five subject-matter
specialists and the two investigators.  Comparative data were then obtained for both the
Maron assignment formula and the modified classification system assignments in terms
of agreement with the manual assignments.

    For the source items, the percentage of machine assignments agreeing with those
made by people was 62.7 when the Bayesian probability formula used by Maron was
applied and 61.2 for the factor analysis score system.  For the new items, the
corresponding correct percentages were 57.9 and 55.9.    Additional data compared the
effects of using the original Maron words and the frequency-based word set (Borko's
words) for the same probability formula assignment method.    While there was an overlap
of approximately 50 percent between Maron's words and Borko's words, the findings
indicated that:


1/


2/

    Now increased to 150 x 150.


    BorkoandBernick, 1962[72], pp. 9-10.


                                       96

      * . The index words selected by Maron are decidedly specific to the documents
    from which they were derived and are of less generality than the frequency based
    terms.  The Bayesian formula coupled with the Maron words correctly predicted
    the classification of 79.6% of the documents inGroupl[ `source items'] but only
    45.5% of the documents in Group II [`test items'].  The coupling of the Bayesian
    f6rmula with the Borko words resulted in a slight decrease in the percentage of
    Group I documents whose classification was correctly predicted (62.7%) but in
    creased the percentage of correct prediction for Group II documents to 58. o%.~' ~`

    Other findings from the later experiments indicated that despite the differences in
the two word-sets, the factor categories derived from them were very similar.     It was
also found that, at least for the source items (~oup I), the two machine techniques and
the manual process classified 56.1 percent of the items into the same categories.   It
should be noted, however, that in the case of the automatic assignment methods:   "Eleven
documents contained no clue words and could not be automatically classified by either
system. ` 2/

4.4 Williams' Disc riminant Analysis Method

    The work of Williams in automatic assignment indexing, reported in the fall of
1963 [642], has also involved tests on abstracts of the computer literature, directly
comparable to but not necessarily identical with those used by Maron and by Borko and
Bernick.  This work at IBM's Federal Systems Division, Bethesda is based in part on
earlier work by Meadow which involved computer studies of matching functions for
document word lists and category word lists for test items drawn from such fields as
psychology, law, computer abstracts, and news items. ~/ What has subsequently been
developed is termed a "discriminant" method which begins with hierarchical classifi-
cation structure of pre-established subject categories and with a small set of sample
documents previously indexed by people into these categories.  Frequency counts of words
in each of the sample documents lead to computations, for each category, of the theoreti-
cally probable frequencies of its most statistically significant words.  For new items,
observed word frequencies are compared with the theoretical word-category associations
and a relevance value is computed for the item in terms of each category.

    The corpus selected for experimentation consisted of 400 items from " Computer
Abstracts on Cards". 4/ These had previously been indexed using a classification
structure of 15 major categories, each of which is divided in turn into 10 subcategories.
The experimental sample, however, was so selected as to provide exactly 15 "source"
items and 5 "new" items for each of 5 subdivisions of 4 of these major categories.


1/


2/


3/

    Borko and Bernick, 1963 [78], p. 23.


    Ibid, p. 11.


    Williams, 1963 [642], cites H. R. Meadow, "Statistical Analysis and Classification
    of Documents", IRAD ~sk No. 0353, FSD IBM, Rockville, Maryland, 1962, but
    this is apparently a company-confidential document, containing proprietary in-
    formation.  Meadow gave an informal report on her work at the Computing Center
    seminars, University of Maryland, in March of 1963.


    Available on a subscription basis from Cambridge Communications Corporation,
    Cambridge, Mass.

                                       97

4/

    Discriminant coefficients were then computed at both the major and minor levels for
all words occurring in the sample items falling into one of the 20 groups in accordance
with the formula:

    "The discriminant coefficient is:

           =   ~n (P~~ -
                        13

               3    P
                     i3.

    Where:

                m
     P.. = f.  I ~
      13     13
                1


and

The relative frequency of the ith word
  in the jth category.

        n
   = 1
     - ~ P.
13   n     ii
        3

The mean relative frequency ~er
  category of the ith word. 11

These coefficients are used both to set up threshold values to determine which words
should be used in the assignment formulas and to assign weighting factors to the words
themselves.

    The results of the experiments to date are based on 83 items from the "reference
set" which were not used as source items.  For 63 items, 78 percent were correctly
classified at the level of a single major category (e.g. , "Programming", `,11ardware
Design") and also correctly classified at a single subcategory level, (e.g. ,  "Program-
ming Languages", "Semiconductor Devices").     The 20 remaining items were classified
to one major category with an accuracy of 95 percent and to two minor level subdivisions
with accuracies of 60 percent and 75 percent.  Additional investigations were made on
the effects of using a discrimination threshold to eliminate insignificant words from
consideration and on the use of weighting factors in the assignment calculations.

4.5 SADSACT

    Stevens and Urban at the National Bureau of Standards (1963 E 569, 570]) have also
explored an automatic indexing technique that uses, as in the experiments of Williams,
a teaching  sample or refer~nce set Qf previously indexed items to form patterns of word
and index-term assignment associations.   However, there are much less formal require-
ments for computing correlation coefficients and no consideration is required of either


1/
    Williams 1963[642], p. 163.


                                          98

the theoretical probabilities of word occurrence by category or of discrimination Co-
efficients and thresholds.  Instead, the technique involves ad hoc statistical associations
between the words occurring in the title and in the abstract of a sample item and the
descriptors previously assigned to that item.   A master selection-word vocabulary is
thus built up where each word is listed in terms of the frequencies of its co-occurrence
with each of the descriptors with which it has co-occurred, regardless of whether or not
such prior a6sociations are either revelant or significant.  No attempt has as yet been
made to "purge" the resulting association lists.  Instead, reliance is placed on the
patterns of multiple word usage and of redundancy of words used in titles and cited titles
of new items to minimize the effects of irrelevant or accidental prior word-descriptor
associations and to enhance the significant ones.

     The SADSACT method (for "Self Assigned Descriptors from Self and Cited Titles")
proceeds with the assumption, which it shares with the arguments for citation indexing
previously discussed, that the literature references cited by an author are indicative of
the subject content or contents of his paper. 1/ For the automatic indexing of new items,
their titles and the titles of up to ten bibliographic references cited are keystroked, con-
verted to punched cards, and fed to the computer.   This input material is run against the
master vocabulary to obtain for each input word which matches a vocabulary word a
"descriptor-selection score" for each of the descriptors previously associated with that
word.  These scores are summed up for all words and at an appropriate cutting level
those descriptors having the highest scores are assigned to the new item.

     Preliminary results based on the titles and cited titles of items that were "source
items" in the sense that their titles and abstracts had been used in the teaching sample
were reported at the NATO Advanced Study Institute on Automatic Document Analysis
held in Venice in July, 1963.  For 30 items drawn from such subject fields as computer
technology, information selection and retrieval, mathematical logic, pattern recognition,
and operations research, all of which had previously been indexed by ASTIA personnel in
1960, the machine assigned 64.8 percent of the descriptors previously assigned.  Sub-
sequent tests on genuinely new items, however, resulted in a drop to only 48.2 percent
"hit" accuracy.

     These "new" item results were also evaluated by having several representative
users of the collection analyze the test items and assign descriptors to them from a list
of the descriptors available to the machine.  The extent to which the descriptors assigned
by machine were also independently chosen by one or more of these indexers was then
checked.   In general, the fewer descriptors assigned by the machine, the better was the
human agreement, ranging from 47.4 percent overall in the case where the machine had
assigned twelve descriptors to each item to 76% agreement where the machine assigned
only one.  In particular, for ten items which were analyzed by five different indexers,
the chances that one or more would also select the machine's first choice (highest scoring)
descriptor averaged 90 percent.

4.6 Assignment Indexing from Citation Data

     Certain phases in the program of investigation of information selection and retrieval
problems at the Harvard Computation Laboratory have been mentioned previously.      The
work of Storm and of Lesk and Storm on the use of first-noun-occurrences as selection
clues for both automatic indexing and abstracting was discussed in connection with tech-
niques for improved derivative indexing.    The studies on citation indexing have included,
as noted, experiments to assign indexing terms to a new document by finding the indexing

1/
     If necessary or desirable, however, abstracts or portions of text can be used in
     addition to or in lieu of the cited titles.

                                             99

terms previously  assigned to the five most  related! documents, where `~relatedness1 is
a function of the similarity in citation patterns as between the new document and items al-
ready in the collection.  The results of such index term assignments are reported as
identical to those made by human judgment approximately 50 percent of the time. 1/

    More specifically, in an experiment using documents drawn from a small collection
in the fields of mathematical linguistics and machine translation, a new item was com-
pared in terms of its citation data with the citation similarity data previously determined
for earlier documents, and the set of five related documents was selected using the
magnitude of the row similarity coefficients obtained from links of length one and two.
All index terms occurring at least twice in the set of terms assigned to these related
items were then assigned to tne new items.   For the ten I!typi~alI! new item cases, for
which comparative data are shown, the citation data assignment method correctly      2/
assigned, on average, 47.6 percent of the terms assigned manually to the same items. -

    A slightly more sophisticated indexing term assignment formula, described by Lesk,
was applied to additional test cases, but I!failed to raise accuracy above fifty percent". 3/
For five typical new cases, the improved method correctly assigned 11 of the 20 terms
manually assigned to these items, or an average accuracy of 55.5 percent 4/

4.7 Similarities and Distinctions among Assignment Indexing Experiments.

    In Table 2 some of the key points of the various automatic assignment indexing
experiments we have discussed above are summarized.     Certain similarities, distinctions,
and differences are to be noted.  Borko and Bernick use the same corpus as did Maron
and also re-apply Maron's formula to a different clue-word set for the same material.
Williams uses material similar to the Maron-Borko computer corpus.   The SADSACT
tests also use some items that might be included in the Maron-Borko and Williams
corpora.  The Swanson experiments with newspaper clippings represent a quite different
dass of material consisting of brief, terse, factual messages.


1/


2/


3/


4/

    Lesk, 1963 [357], p. V-8.


    Salton, 1962 [520], p. 111-41, Table 9


    Lesk 1963E357], p.V-7.


    Ibid, p. V-8, Table 3.


                                       100

                          Table 2.  Summary 0� Automatic Assignment Indexing Test Evaluations

Investigator

Principles and Methods

Materials
  Used

Tests

Remarks

  ~aron

0

               Statistical probabilities of
               association between clue
               words and pre-established
               subject categories. Source
               items manually indexed to
               32 categories.  A subclass
               of words occurring in the
               corpus selected as clue
               words, and statistical cor-
               relations obtained for 90
               such words with categories
               assigned.  Correlation data
               and Bayesian probabilities
               used to assign categories
               to new items.

                                          Corpus of 405
                                          items selected
                                          from computer
                                          abstracts,
                                          PGEC, 1959.
                                          Full text,
                                          20, 000 words
                                          of which 3, 263
                                          were different
                                          words.

                                                         For 260 source items,   12 did not
                                                         contain any clue words, 247 were
                                                         indexed, 1 contained an error
                                                         preventing processing.  For the
                                                         247 source items indexed, pro-
                                                         bability of top-ranked category
                                                         being correct  84.6%.   For 145
                                                         new items, 20 not indexed be-
                                                         cause they contained no clue
                                                         words.   In 85 cases where at
                                                         least 2 clue words occurred,
                                                         probability of correct category
                                                         assignment = 51.8%.

                                                                                           Considerable ma~
                                                                                           inspection and jud
                                                                                           ment involved in
                                                                                           selection of clue
                                                                                           words.  Some ne~
                                                                                           items cannot be p
                                                                                           cessed, because I
                                                                                           contain no clue w(

   3orko        Factor analysis to determine   Psychological     Factors selected were judged         Some new items C
                distinctive grouping of clue   abstracts.   618  to be compatible with but not        not be processed,
                words. Word frequency          abstracts,        identical to subject classif-        because they cont
                counts made, 90 of the 2.0     50, 000 text      ication terms used for these         no clue words.
                most frequent non-common       words; 6, 800     items by the American
                words manually selected.       different         Psychological Association.
                Correlation matrix com-        words.
                puted, factors rotated and
   _____________ interpreted.                 _________________ ________________________________________________________

                                                    Table 2 (cont.)

                                                Materials
  Investigator    Principles and Methods          Used                   Tests                             Remarks

  Borko and      Factor analysis to determine  Same corpus as   Detailed comparison with             Some items canno
  Bernick        distinctive groupings of clue Maron, 405       Maron's technique.    For the        processed becausE
                 words. Maron's 90 clue        computer         source items, 63.4% were             contain no clue wc
                 words used for word-word      abstracts, of    correctly classified.  For the
                 correlation and factor        which 260        new items, 46.5(70 correctly
                 analysis. 21 factors          used to          indexed, and 48.9% were
                 developed, and items          establish        correct for those items in
                 manually re-indexed to        factors, 145     which 2 or more clue words
                 these categories.             as new items.    occurred.

  Swanson        Text word lookup against      Brief news       Machine assignments compared to
                 clue word lists, construct-   dispatches       manual subject indexing.   For a
                 ed by careful analysis of     available on     first batch of 500 items, 569 assign-
                 sample items to be ex-        teletype tape,   ments of correct headings,  119
                 clusively indicative of a     wide diversity   assignments of irrelevant headings,
0                particular subject heading    of topics.       and 32 correct headings missed.
                 Machine assigns a subject     From study of    The clue word thesaurus was then
                 heading to an item if any     several 1, 000   revised. For 275 additional test
                 word on its list occurs in    items, 24 sub-   items, results showed 282 correct
                 that item.                    ject headings    assignments, 29 irrelevant assign-
                                               established and  ments, 1 missed.  For total, aver-
                                               word lists se-   ages of 17% irrelevant assignments,
                                               lected, averag-  3% missed. For 200 items, mach-
                                               mg approximat-   me and manual assignments were
                                               ely one hundred  compared with respect to 5 of the
                                               per category.    subject categories, with the
                                               775 new items    following results:
                                               then tested.               Man     Machine
                                                                Irrelevant   4            25
                                                                missed      46            4
  ___________ _____________________________                     correct     75          116

                                            Table 2 (cont.)

Investigator

Principles and Methods

Materials
 Used

Tests

Remarks

,tevens and
Urban

Teaching sample for machine
compilation of co-occurrence
data for words in titles and
abstracts with descriptors
assigned to these items.
Words in titles and cited
titles of new items then run
against master list of pre-
vious word-descriptor assoc-
iation to derive descriptor-
selection scores3 highest
scoring descriptors (e.g.3
up to 12) assigned. Assoc-
iations derived for 1, 600
words co-occurring with
any of 70 descriptors pre-
viously assi~ned.

Two teaching
samples, ap-
proximately
100 items each
with 70% over-
lap, drawn
from items in-
dexed byASTIA.
For new items
titles and up to
10 cited titles.

For 59 test items, assignments of
descriptors that had occurred for
at least 3% of the sample items
agreed with ASTIA assignments
58.1%.  However, for all des-
criptors assigned by ASTIA, many
not available to machine, overall
machine accuracy = 40.1%.  For
20 items, independently evaluated
by several typical users, the
chances that one or more people
would agree with the machine
assignments ranged from 47.1%
when 12 descriptors were assigned
to 75.0% average agreement with
the machine's first choice.

All test items co
processed and ur
different descrip
assigned to each,
some descriptor~
in manual indexir
these items are r
available to the
machine.

~illiams

Discriminant analysis.
Sample items previously
indexed to a 2-level clas-
sification system were
subjected to word fre-
quency counts and the
theoretical frequencies of
the most significant words
in each category were com-
piled. For new items, ob-
served word frequencies'
compared with theoretical
frequencies for each cate-
gory, highest scoring
assigned.

Items from
"Computer
Abstracts on
Cards" index-
ed to 15 major
categories each
divided into 10
minor catego-
ries.  300 ab-
stracts selected
to provide equal
distribution toZO
sub -categories,
5 each in 4major
categories. Add-
itional items for
test similarly
selected.

For 63 new items assigned by
machine to 1 major and 1 minor
category, 78% correct at major
level, 64% correct at minor level.
For 20 items classified to 1 major
and 2 minor categories, 95% cor-
rect at major lev~l, 60% and 75%
correct at the minor level.

   None
test item
bodies of

        of the experiments has so far encompassed testing of anything but very small
        samples and the dangers of extrapolating from so small and so specialized
        data should be clearly recognized.  Mooers identifies these dangers in terms of

    "The Silent Postulate:

                   (real people)
    That           (real documents)                can somehow
                   (real jobs to do)

    be eliminated from the experimental study, and that (role-playing people)
                                                        (substitute documents)
                                                        (imaginery jobs)
                                                                1/
    can be substituted and still give valid experimental results." -

    In most of the experiments in automatic indexing conducted to date, indexing and
classification schedules have been especially designed, or evaluations made, specifically
for the purposes of these tests.  Williams, however, stresses the point that the material
used in his experiments had been "classified by professional indexers for the purposes of
actual retrieval." 2/ A similar claim can be made for SADSAGT, as noted by Mooers. 3/
Swanson's news item work also obviously relates to real items and implies a real job
to be done, but is directed, as noted, to a class of material not generally comparable to
that found in documentation operations on scientific and technical literature.

    In contrast with the treatment of each document as a self-contained entity without
reference to any other documents, as is the case for derivative indexing, all of the
automatic assignment indexing experiments, by virtue of the fact that they are assign-
ment techniques, do to some extent embody the effects of a consensus of a particular
collection, or a consensus of prior indexing, or a consensus of human subject content
analysis applied to sample documents, or some combination of these effects.    The SAD-
SACT method, in addition, wherever cited titles are available for new items, takes
advantage of terminology other than the author's own as a source of clue words.  Other
proposed methods of assignment indexing, such as the use by Salton, Lesk, and Storm of
citation-pattern similarity data, would carry the latter principle even further.


1/


2/


3/

    Mooers, 1963[424], p.5.


    Williams, 1963 [642], p. 162.


    Ibid, p. 5.


                                         04

4.8   Other Assignment Indexing Proposals

      A few additional automatic assignment indexing proposals are under development.
Examples for which experimental data is not as yet generally available include, for
example, work at EURATOM, sonie preliminary experiments at Chemical Abstracts
Service, work at General Electric, Bethesda, the proposed "Multilindex" system of
Information Systems, Inc. , investigations by Slamecka and Zunde, and a special purpose
development project at Goodyear Aerospace.

      Meyer-Uhlenried and Lustig report for the EURATOM developments as follows:

      .... . Procedures are being developed which allow based upon given keyword
      lists first for abstracts: (a) to assign significant keywords and (b) based
      upon hierarchically organized keyword lists, to assign the documents in
      question to specific subject fields.

      "Experiments were made at first on narrow fields with so-called micro-
      thesauri, they showed encouraging result~ when automatic and manual assign-
      ment were compared.      Positive results depend of course on the quality of the
      abstracts and the significance of the words employed in them.  It remains to
      see how far this favorable prognosis is confirmed by keyword collections of
      more complex contents." 1/

      Friedman and Dyson (1961 [203]) have reported on manual experiments designed to
relate words occurring in a sample of abstracts from a particular section of Chemical
Abstracts to the title or heading for that section.  Significant words in these abstracts
were counted and the number of occurrences as well as the number of different abstracts
in which they appeared were determined, with a rank order listing as a result.  It
appeared, from inspection, that it should be feasible to develop, for each CA section, a
relatively small vocabulary of words that would be descriptive, and indicative of, the
subject matter contained in it.  They conclude:    "In our opinion, the results were signifi-
cant, the small vocabulary of words did select a large percentage of the abstracts in the
section it was based on." 2/

      A project at Information Systems Operations, General Electric, on possibilities
for automatic indexing and abstracting of text has been reported in the November 1962
issue of Current Research and Development.3/The META project (Methods of Extracting
Text Automatically) is said to be concerned with the use of statistical, linguistic, and
semantic criteria for analysis and selection of significant words and significant sentences
from text.  Computer programs are being developed in modular fashion for the GE-225
computer.

1/


2/


3/

     Meyer-Uhlenried and Lustig, 1963 [417], p.229.


     Friedman and Dyson, 1961 [203], p. 10.


     National Science Foundation's CR&D report, No. 11 [430], p.. 97.


                                       I OS

      The proposed "Multilindex" system is also based on micro-thesauri or small
 vocabularies designed, by human analysis, for clue-indications to a relatively narrow
 subject field, together with potential syntactic-semantic role indications built into the
 dictionary, again by extensive human analysis, following the approaches previously taken
 by A. L. (Lukjanow) Loewenthal in her suggestions for solutions to problems of mecha-
nized translation.  An unpublished proposal-type brochure describing the system was
 available as of December 1963.!' As of that date, also, demonstration printouts were
 available from an IBM 1401 ~ortran program, illustrating an index compiled from
 abstract-text input and a 1,200-word dictionary for documents in the field of space an-
 tenna tracking radar. ~/ A repetoire of 350 1'concepts" or indexing terms was involved,
 with an average of 10 assigned to 22 test documents, many of these assigned terms being
 identical to words occurring in either the title or the text of the abstract of the item.

      Slamecka and Zunde have investigated the extent to which the "notations-of-content"
 in the system developed by Documentation, Inc. for NASA's STAR might be derived by
 machine techniques from the text of the abstracts with enough normalization-standardi-
 zation via inclusion dictionary lookup to qualify as an assignment indexing technique.
 These workers claim:

     "This preliminary investigation indicat~s the possibility of using the computer
     to index documents adequately for machine retrieval by matching their abstracts
     against an authoritative subject-heading authority . .. The inconsistency inherent in
     human indexing can be eliminated as the number of terms derived from any one
     abstract will always be the same.  The abstract and its automatically derived set of
     index terms will always be equivalent. . . "3/

     A final example of other approaches to automatic assignment indexing research, not
yet reported in the open literature, is an NIH sponsored project at Goodyear Aerospace, in
cooperation with the Universities of Minnesota and Rochester and Western Reserve
University, looking toward an automatic classification procedure based on word coocur-
rences for a s~t consiting of ~00 four-to-five page documents in the field of diabetes
literature.  Programs for statistical analyses of the full text of these documents, all of
which have previously been processed for the manual W. R. U. "telegraphic" abstracting
system, are being developed.  4/

                5.  AUTOMATIC CLASSIFICATION AND CAT EGORI ZAT ION

     In all the experimental work, to date, that has been directed toward the use of
computers and other machine -like techniques for the automatic indexing of documents, a


 1/


 2/


 3/


 4/

      "Description of MULTILINDEX.  A mechanized system for indexing documents,
      storing information, retrieving information", P.S. Shane, Dec. 4, 1963, In-
      formation Systems, Inc., 7720 Wisconsin Avenue, Bethesda, Maryland.


      Private communications, A.L. Loewenthal and P.S. Shane, Dec. 11, 1963.


      Slamecka and Zunde, 1963, [561], pp. 139-140.


      F. Tuttle, private communication, Oct. 30, 1963.


                                       106