MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Problems of Evaluation chapter Mary Elizabeth Stevens National Bureau of Standards Ilif the answer turns out to be `no!, we might reasonably conclude that the only reliable and effective kind of human indexing is that which is already machine- like in nature.!! 1[OCRerr] With a few noteworthy exceptions, there has been very little serious investigation of these problems and there is very little comparative data. O1Connor has been making a series of studies, with considerable emphasis upon how one might measure the products of machine indexing and how one might derive machine rules for automatic index mg from systematic review of documents indexed by people. Cleverdon and his associates at the ASLIB Cranfield project have extensively tested several different indexing procedures. Painter, MacMillan and Welt, Slamecka and Zunde, and others report findings on intra-indexer, and inter-indexer consistency - - unfortunately, on the basis of quite small samples. Various alternate approaches to the evaluation of automatic indexing results have been considered by Borko, Doyle, Swanson, Savage, Giuliano, and others. In addition, some data bearing on these questions have been reported in connection with analyses of selective dissemination (SDI) systems. Some data from other sources, such as studies of user preferences with respect to [OCRerr]rious reference and search tools, is also pertinent. The most generally accepted criterion for appraising the effectiveness 0 f indexing is that of retrieval effectiveness. But, in general, this is merely the substitution of one intangible for another, entailing a string of as yet unanswerable or at least un- resolverA questions.21 Retrieval of what, for whom, and when? How can effectiveness be measured except by the elusive question of relevance judgments? How can human judg'- ments of relevance and value be measured and quantified? We shall try to distinguish here, insofar as possible, between the core problems that make the evaluation of indexing as such an extremely difficult task, the available data on human indexer reliability, and the possible advantages and disadvantages of automatic indexing techniques. 1/ 2/ Montgomery and Swanson, 1962 [421), p. 366. Compare Swanson, 1960 [582], pp. 2-3: `The performance of retrieval experi- ments when relevance judgments per se cannot be consistently assessed by human judgment would seem to represent overly vigorous pursuit of a solution before identifying the problem." Similarly, see Black, 1963 [64], p. 14: !!Finally, when one is faced with an existing collection of indexed materials, how does one assess the effectiveness of any retrieval system? Suppose that one receives 20 documents as a result of a query to the system. Suppose further that all 20 docu- ments are quite pertinc'nt to the topic of interest. Is there any way to assess the amount of pertinent information still unretrieved from the file? Or is there any way of learning whether the retrieved information is more pcrtin(nt than the un- retrieved information ? `1'h[OCRerr] answer is `No! ` -- the use of any retrieval system is, then, an act of faith in the quality of indexing.!! 144