MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Conclusion
chapter
Mary Elizabeth Stevens
National Bureau of Standards
9. CONCLUSION: APPRAISAL OF THE STATE OF THE ART IN AUTOMATIC INDEXING
Notwithstanding the difficulties of evaluation we have discussed, we shall herewith
attempt to evaluate the present state of the art in automatic indexing techniques3 using such
available criteria as seem most appropriate. First, we suggest that all of out initial
questions except possibly the last, can today be answered affirmatively. "Is indexing by
machine possible at all?" To this we can answer an unequivocal "yes" in view of the many
examples of KWIC type indexes extant and in practical use. Secondly, "Is what can be done
by machine properly termed `abstracting', `indexing', or `classifying'?" If, by definition,
word indexing of any kind is not "properly termed... indexing", then, as we have seen,
automatic derivative indexing, such as KWIC, or the selection of words to serve as index
tags based upon the frequencies of their occurrence in text, is not so either.
The fundamental Luhn concept for indexing based on word frequencies is, as we have
seen, straightforward: namely that, after disregarding the most frequent "common words",
especially those that are syntactic-function words -- articles, conjunctions, prepositions,
and the like, together with those words that occur infrequently in a given text, the remain-
ing high frequency words should give a reasonable indication of what the author was writing
"about". Critiques of the Luhn position have been made on several-fold grounds:
(1) Information-theoretic - that, in fact3 the most information is conveyed by
the least frequent words.
(Z) Absolute vs. relative frequencies of usage within specialized fields.
(3) Modifications of semantic purport by contextual and syntactic associations.
(4) Problems of synonymity and, conversely, of orthographically identical
words. 1/
(5) Multi-aspect points of interest, and future need of access to material the
author himself did not emphasize.
The last point raises again the criticisms that have been made against derivative,
extractive or "word" indexing of all types. To repeat, although such procedures may
index "as the author himself indexed best -- in his own language", the significant points
are (1) there may be peripheral, minor, or unrecognized aspects of his topic and incident-
al information disclosed, of future interest to others, which the author himself is in no
special position to recognize, and (Z) notwithstanding the "author's own terminology" being
current usage rather than the "fossilized" vocabulary of any previously established classi-
fication or indexing scheme, this very "currency" changes from field to field and, quite
literally, from day to day. Nevertheless, it should be re-emphasized that the validity of
these criticisms is not limited to automatic derivative indexing as such, but rather is
applicable against any indexing system whatsoever, manual or machine, which is so
strictly limited to author-terminology, author-emphases, and the consideration of the
document at hand as a self-contained entity, without regard to other documents in a col-
lection, in a particular field, and without respect to specific user needs. By contrast to
this type of limitation, more promising approaches should stress both similarities and
differences between a new document and previously received documents, between docu-
ments "belonging" to some definable category, or not, and even, as responsive to a partic -
ular user's profile-of-interest, or not.
1/ See Baxendale, 196Z [42], pp. 67-68: [OCRerr] resolution of orthographic ambiguities
is a non-trivial and over-riding prerequisite for the computer processing of
text...", p. 67.
173