MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Other Potentially Related Research
chapter
Mary Elizabeth Stevens
National Bureau of Standards
"Automatic indexing, based on the relative frequency of words used in a document,
produces a partial vocabulary of the content words used to express its subject.
Retrieval can then be accomplished by expanding the request vocabulary... This
method tends to overcome the deficiencies and inconsistencies inherent in the use
of terms derived automatically from a text. " 1/
Conversely, Stiles also points out the possibility that the results of automatic derivative
indexing procedures, extracting indexing words from the documents directly, might prove
a more realistic or reliable basis for the development of his word co-occurrence correla-
tion data than do the Uniterms assigned by human indexers. 2/ The work of Stiles has also
stressed the importance of two factors that may well be critical for the improvement of
automatic indexing techniques. These are, namely, the consensus of prior human indexing
and the consensus of subject coverage of a particular collection. 3/
In his experimental investigations, Stiles began with an existing collection of approx-
imately 100, 000 items which had previously been indexed, over a period of time, with a
Uniterm indexing vocabulary consisting of about 15, 000 terms. The objective of the
experiments was to determine how, given a specific search request, a more effective "net
to catch documents" 4/ could be generated and how the responding items might be ranked
in order of their probable relevance to the request.
The statistics of co-occurrence of terms used to index the same documents were first
obtained. A modified chi-square formula was then applied to determine relative fre-
quencies of use of co-occurring terms. 5/ Patterns of term co-occurrence could then be
derived in the sense of term[OCRerr]profiles which show, for each term, the more significant of
its associational values of pairing with other terms in the collection. The actual procedure
for using these term[OCRerr]profiles in search prescription formulation and in document selection
involves several steps, generally as follows: 6/
1/
2/
3/
4/
5/
Stiles, 1962 [573], pp. 12-13.
Stiles, 1961 [572], p. 205.
Stiles, 1962 [573], p. 6 and 1961 [572], pp. 273, 277.
Stiles, 1961 [572], p. 192.
In general, we shall not be concerned with the precise mathematical formulations.
It is to be noted that in a recent report Giuliano and his colleagues have reviewed
a number of the various mathematical formulas proposed in the literature for the
computation of word, term, and document associations, including those of Parker-
Rhodes and Needham, Maron and Kuhns, Stiles, Salton, Osgood, Bennett and
Spiegel (Giuliano et al, 1963 [230], Appendix I).
6/
Stiles, 1961 [571], pp. 273-275.
120