MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
for purposes 0£ identifying document contents and to use data on the joint occurrence of
words in the same sentence or similar contexts as grouping criteria. Clark points out in
particular that the use of ordered pairs and longer sequences of words to express a single
concept may be highly characteristic of the special technical language used in a specific
subject field, and notably those of the social sciences. 1/
Others who have explored word n-tuples as selection criteria for automatic extraction
operations include such investigators as Szemere, Levery, and Yakushin. Szemere
reports an investigation of 39 Swedish patent specifications in the field of
switching circuits looking for significant word-pairs, with emphasis on noun-adjective
combinations (1962 L591J) The objectives of a project headed by Levery at IBM - France
have been reported as follows:
`1A series of experiments is planned in the fields of automatic indexing of
technical texts and technical vocabulary analysis.
"A statistical method will be tested to determine the degree of closeness in
meaning of words. The method will consist of studying the pairs of words which
appear together in the majority of texts and calculating a coefficient of corre-
lation from the frequencies. Such work will result in a standard list of notions
frequencies for a particular kind of information.
"Starting from this list, new experiments will be made so as to obtain a list
of keywords representing each text. The method will use statistical comparison
between the distribution of frequencies of notions contained in a text and the
standard distributions obtained for the entire corpus." 2/
Yakushin(1963 [654[OCRerr]) develops a variation of the word-pair principle in which he
looks for those pairs where the words are, or suggest, names of objects, such as
11table-leg'1. He suggests, further, that so-called `1basis nouns" can be established for
a given scientific field and entered into an inclusion dictionary, which also contains codes
for the lexical classes to which the word can belong and codes for determining whether or
not the word can join with another as a "basis term". Machine routines are then
suggested to develop whether or not given terms are jointly part of the same text, whether
one textually precedes another in a given text, whether or not there is a "nomenclator"
pair. Depending upon the frequency of occurrence of identical or semantically related
nomenclator constructions, it is claimed that subject concepts can be detected. That is:
"The method is founded on the finding in a text of so-called basis terms,
established by list, and of the words which explain them. These explanatory
words, which in different contexts refer to one basis term, are grouped and
ordered according to definite rules into a subject concept." 3/
1/
2/
3/
Clark, 1960 [123], p.460.
National Science Foundation's CR&D report no. 11, [430], p. 118.
Yakushin, 1963 [654], p.16.
80