MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Appendix B: Progress and Prospects in Mechanized Indexing
appendix
Mary Elizabeth Stevens
National Bureau of Standards
Korotkin and Oliver 16/ report that five psychologists and five non-
psychologists indexed 30 items with three descriptors per item. The task
was repeated two weeks later with the aid of an alphabetized list of 11sug-
gested" descriptors derived from the data acquired in the first session.
Mean percent consistency results were as follows:
Session I Session II
Group A (Psychologists) 39.0% 53.0%
Group B (Non-psychologists) 36.4% 54.0%
2. Evaluations of relevancy of selected items to a given search request have
been explored by Badger and Goffman 17/ as follows: 11Each of three eval-
uators was asked to dissect the output into relevant and non-relevant
subsets... A chi-square test was applied to the observed evaluation as
compared to those expected if the three evaluators were in complete agree-
ment. The chi-square test of 81.57 was very significant, indicating that there
was an absence of agreement."
3. Greer 18/ reports on investigations of the interpersonal agreements between
subjects asked to list the search words they would use in posing queries in
the field of information storage and retrieval systems. He found "a mean
percentage consistency agreement of 26.1 among subjects in stating search
words.
4. Hammond'2/ provides a sampling of the use by NASA (National Aeronautics
and Space Administration) and DDC (Defence Documentation Center) of a
common set of indexing terms to index an identical set of 996 technical
reports. In considering 3-term searches against the variant indexing shown
in Hammond's tables, sample calculations show a 25-30 percent failure to
retrieve potentially relevant items.
5. lii terms of intra-indexer consistency, Rodgers 20/ reports that: "A
consistency of .59 in selecting words to be indexed on two different occasions
is not sufficiently high to give us great confidence in expecting a stable store
when human indexers are used."
For these reasons, increasing consideration should be given to the second interpreta-
tion of the term "mechanized indexing", that is, to machine generation of index entries, or
automatic indexing. This typically involves machine processing of some natural language
text, with severe problems of input. The first of several solutions involves use of
automatic character recognition techniques to convert printed text to machine-usable form.
This approach holds considerable future promise, but there are many current limitations
and difficulties.
A second possible solution, manual keyboard operations to produce a machine-useful
transcription of a text, is plagued by high costs (i.e., at least $0.01 per word for unver-
ified keypunching), and also by limitations of available time or manpower.
A third alternative is suggested by current developments in computerized typesetting
or tape-controlled casting or photocomposition machines. However, while such techniques
promise major improvements for the automatic indexing of textual information to be pub-
lished in the future, little can be done for already available literature, even with respect to
the bibliographic citation information alone. Today's difficulties are emphasized by
estimates of a cost of 30 million dollars to convert the present Library of Congress catalog
to machine-readable form 21/.
225