MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Other Potentially Related Research
chapter
Mary Elizabeth Stevens
National Bureau of Standards
in identical terms and not in synonymous ones. If the existence 0£ synonyms
is avoided, by using a small number 0£ exclusive descriptors, the description
0£ a document in terms useful for retrieval is more difficult, also it is equally
difficult to relate a request to the description of documents. A further difficulty
is that descriptions only list the main terms, and take no account of their relations
to one another. The C. L. R. U. experiments being carried out make use of a
thesaurus, a procedure through which it is hoped that these difficulties will be
avoided and that a request for a document although not using the same terms as
those in the document will produce that document and others dealing with the
same problem, but described in different, though synonymous, terms."
In general, the use of a thesaurus to constrain variations in word or term usage
(as in our first definition, a mechanized authority list), to reduce synonymity, to resolve
homographic ambiguity, to provoke and suggest additional terms or ideas to indexer and
to searcher alike, is related to the improvement of automatic indexing proced[OCRerr]res in
precisely the same sense that its use would be effective in any indexing system whatso-
ever. In another sense, however, the construction and use of the thesaurus is related
to linguistic data proc[OCRerr]ssing by machine in another way. Garvin suggests:
..... One may reasonably expect to arrive at a semantic classification of the content-
bearing elements of a language which is inductively inferred from the study of
text, rather than superimposed from some viewpoint external to the structure of the
language. Such a classification can be expected to yield more reliable answers to
the problems of synonymy and content representation than the existing thesauri
and synonym lists, which are based mainly on intuitively perceived similarities
without adequate empirical controls." 2/
This is with respect to the recognition that the machine itself can be used to compile
and construct the thesaurus. While Luhn in some of his 1957-8 proposals still considered
the compilation and organization of a thesaurus to be primarily a matter of human effort,
he nevertheless pointed out that: "The statistical material that may be required in the
manual compilation of dictionaries and thesauri may be derived from the original texts
in any desired form and degree of detail." De Grolier makes the complementary
statement that the Luhn techniques should "considerably facilitate" the preparation of
thesauri. 4/
Even more importantly, the computer can be used for periodic up-datings and
revisions. The work on the FASEB index-term normalization procedures involved early
recognition of the need to "educate the thesaurus" by examining print-outs when no
matches occurred and providing a continuous process of amendment. [OCRerr]/ Computer-
maintained statistics of word and term usages are closely related to possibilities for
1/
2I
3/
4/
5'
Masterman, Needham, and Sparck-Jones, 1958 [405], p. 934-935; Needham and
Joyce 1958 [305].
Garvin, 1961 [224], p. 138.
Luhn, 1959[354], p. 12.
De Grolier, 1962 [152], p. 132.
Shepherd, 1963 [545], p. 392.
117