MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
"[In the frequency matrix[OCRerr] . . the diagonal elements . . give the total frequency of an
index term and the off-diagonal gives the frequency of co-occurrence of two terms.
The diagonal of the `context' matrix represents that portion of the total vocabulary
with which an individual term has been coordinated, and the off-diagonal the extent
to which two terms have common context. . . Such matrices give a basis for examining
the extent to which terms are generic or specific within the context of the collection
of documents. One can speculate that terms occurring with high frequency and wide
context, i.e., with frequencies distributed amongst all or nearly all off-diagonal
elements of the matrix are of such broad connotation as to be indifferent discrimina-
tors of content . . . The frequency and context matrices can again be used to deter-
mine the modifiers with which they can most r[OCRerr]eaningfully be coupled for the
collection of documents being considered. 11 11
Finally, Baxendale notes that on the basis of her studies it should be possible to
select quasi-subject headings based on frequency counting criteria, but then to order the
remaining vocabulary of selected terms according to contextual measures of association
which are semantic, syntactic, or statistical in nature. Fxperimental results for a
collection of 1, 500 documents included semantic associations between "searching" and
"retrieval", syntactic associations of "machine" or "literature" with "retrieval", and
the apparently misleading association of [OCRerr][OCRerr]metal[OCRerr]! with `1retrieval" which, however, had
statistical significance within the particular document sample. 2/
Other investigators who have explored noun-adjective clues for selection include
Anger, Chonez, Langleben and Shumilina, and Swanson. Anger looked for relationships
indicated by syntactic dependencies or by noun-adjective and adjective-adverb linkages,
and gave in an appendix a suggested program for phrase inversions. 3/ Chonez has
described a computer program which by recognizing "separating" words, especially
prepositions, and applying "pseudo-grammatical" rules compiles an index to English
language items in the fields of ionized gas physics and thermonuclear fusion. It is
claimed that:
"The subject index thus prepared is similar in presentation to Luhn's KWIC indexes,
but is fundamentally different in conception and is in fact intermediate between...
(this) ... and the conventional alphabetic subject indexes." 4/
Langleben and Shumilina are concerned with machine-aided procedures for trans-
lation from natural language materials to an intermediary or documentation language.
1/
2/
3/
4/
Ibid, pp.215-216.
Ibid, pp. 216-217.
Anger, 1961 [151 pp. III-6ff.
Ghonez, et al, 1963 [119], p. 31.
74