NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Conclusion chapter Mary Elizabeth Stevens National Bureau of Standards 9. CONCLUSION: APPRAISAL OF THE STATE OF THE ART IN AUTOMATIC INDEXING Notwithstanding the difficulties of evaluation we have discussed, we shall herewith attempt to evaluate the present state of the art in automatic indexing techniques3 using such available criteria as seem most appropriate. First, we suggest that all of out initial questions except possibly the last, can today be answered affirmatively. "Is indexing by machine possible at all?" To this we can answer an unequivocal "yes" in view of the many examples of KWIC type indexes extant and in practical use. Secondly, "Is what can be done by machine properly termed `abstracting', `indexing', or `classifying'?" If, by definition, word indexing of any kind is not "properly termed... indexing", then, as we have seen, automatic derivative indexing, such as KWIC, or the selection of words to serve as index tags based upon the frequencies of their occurrence in text, is not so either. The fundamental Luhn concept for indexing based on word frequencies is, as we have seen, straightforward: namely that, after disregarding the most frequent "common words", especially those that are syntactic-function words -- articles, conjunctions, prepositions, and the like, together with those words that occur infrequently in a given text, the remain- ing high frequency words should give a reasonable indication of what the author was writing "about". Critiques of the Luhn position have been made on several-fold grounds: (1) Information-theoretic - that, in fact3 the most information is conveyed by the least frequent words. (Z) Absolute vs. relative frequencies of usage within specialized fields. (3) Modifications of semantic purport by contextual and syntactic associations. (4) Problems of synonymity and, conversely, of orthographically identical words. 1/ (5) Multi-aspect points of interest, and future need of access to material the author himself did not emphasize. The last point raises again the criticisms that have been made against derivative, extractive or "word" indexing of all types. To repeat, although such procedures may index "as the author himself indexed best -- in his own language", the significant points are (1) there may be peripheral, minor, or unrecognized aspects of his topic and incident- al information disclosed, of future interest to others, which the author himself is in no special position to recognize, and (Z) notwithstanding the "author's own terminology" being current usage rather than the "fossilized" vocabulary of any previously established classi- fication or indexing scheme, this very "currency" changes from field to field and, quite literally, from day to day. Nevertheless, it should be re-emphasized that the validity of these criticisms is not limited to automatic derivative indexing as such, but rather is applicable against any indexing system whatsoever, manual or machine, which is so strictly limited to author-terminology, author-emphases, and the consideration of the document at hand as a self-contained entity, without regard to other documents in a col- lection, in a particular field, and without respect to specific user needs. By contrast to this type of limitation, more promising approaches should stress both similarities and differences between a new document and previously received documents, between docu- ments "belonging" to some definable category, or not, and even, as responsive to a partic - ular user's profile-of-interest, or not. 1/ See Baxendale, 196Z [42], pp. 67-68: [OCRerr] resolution of orthographic ambiguities is a non-trivial and over-riding prerequisite for the computer processing of text...", p. 67. 173