NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Appendix B: Progress and Prospects in Mechanized Indexing appendix Mary Elizabeth Stevens National Bureau of Standards Korotkin and Oliver 16/ report that five psychologists and five non- psychologists indexed 30 items with three descriptors per item. The task was repeated two weeks later with the aid of an alphabetized list of 11sug- gested" descriptors derived from the data acquired in the first session. Mean percent consistency results were as follows: Session I Session II Group A (Psychologists) 39.0% 53.0% Group B (Non-psychologists) 36.4% 54.0% 2. Evaluations of relevancy of selected items to a given search request have been explored by Badger and Goffman 17/ as follows: 11Each of three eval- uators was asked to dissect the output into relevant and non-relevant subsets... A chi-square test was applied to the observed evaluation as compared to those expected if the three evaluators were in complete agree- ment. The chi-square test of 81.57 was very significant, indicating that there was an absence of agreement." 3. Greer 18/ reports on investigations of the interpersonal agreements between subjects asked to list the search words they would use in posing queries in the field of information storage and retrieval systems. He found "a mean percentage consistency agreement of 26.1 among subjects in stating search words. 4. Hammond'2/ provides a sampling of the use by NASA (National Aeronautics and Space Administration) and DDC (Defence Documentation Center) of a common set of indexing terms to index an identical set of 996 technical reports. In considering 3-term searches against the variant indexing shown in Hammond's tables, sample calculations show a 25-30 percent failure to retrieve potentially relevant items. 5. lii terms of intra-indexer consistency, Rodgers 20/ reports that: "A consistency of .59 in selecting words to be indexed on two different occasions is not sufficiently high to give us great confidence in expecting a stable store when human indexers are used." For these reasons, increasing consideration should be given to the second interpreta- tion of the term "mechanized indexing", that is, to machine generation of index entries, or automatic indexing. This typically involves machine processing of some natural language text, with severe problems of input. The first of several solutions involves use of automatic character recognition techniques to convert printed text to machine-usable form. This approach holds considerable future promise, but there are many current limitations and difficulties. A second possible solution, manual keyboard operations to produce a machine-useful transcription of a text, is plagued by high costs (i.e., at least $0.01 per word for unver- ified keypunching), and also by limitations of available time or manpower. A third alternative is suggested by current developments in computerized typesetting or tape-controlled casting or photocomposition machines. However, while such techniques promise major improvements for the automatic indexing of textual information to be pub- lished in the future, little can be done for already available literature, even with respect to the bibliographic citation information alone. Today's difficulties are emphasized by estimates of a cost of 30 million dollars to convert the present Library of Congress catalog to machine-readable form 21/. 225