MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Problems of Evaluation
chapter
Mary Elizabeth Stevens
National Bureau of Standards
but except for the obvious statistical criteria, the problems of how to measure relevancy
remain largely unresolved.
At least some data on the variability of relevance judgments is available in reports
of the performance of an SDI (Selective Dissemination of Information) system. In such
systems, the indexing terms or tags assigned to a new item are compared with a file of
11user-profiles" that is, with a pre-prepared listing of terms or topics in which a
particular user is interested. Where the term-profile of a new item matches that of a
user, a notification of the acquisition of that item is sent to him. Barnes and Resnick
report tests of such a system in which pseudo-notifications selected randomly were
included with those produced from the matching procedure. Account was kept of which
notices were regarded by the users as meeting their interests and which were not. They
found that 58.1 percent of the non-random notifications were regarded as relevant, but
that so also were 26.8 percent of the random ones. 1/
Katter comments on findings that the intersubjective agreement of typical users
with respect to value judgments of condensed representations of text is low. He
suggests:
"One source of this low intersubjective agreement among users may be that it is
often not clear what is intended by the words relevant and representative. Con-
siderations such as the validity of the material, its usefulness, stylistic qualities,
understandability, conceptual preferability, etc., can all enter their judgments in
unknown amounts.?! 2/
Corroborating evidence is available from other sources. Swanson, in his tests of
a natural language text searching technique, had first used subject matter specialists to
rate the relevance of each of the text documents to each of 50 questions. Two individuals
rated each item, and if they disagreed significantly, a third person was asked to reconcile
the difference. In spite of this, 8 percent of the cases of failure to retrieve "relevant"
documents were ascribed to incorrect initial judgments of relevance, and 15 percent of the
presumably "irrelevant" documents were finally judged to be relevant after all (Swanson,
1961 [OCRerr]86 j ) In Swanson's words: "The question of formulating criteria for judging the
relevance of any document to the motive, purpose, or intent which underlies a request for
information is profound and lies at the heart of the matter."
1/
2/
3/
Barnes and Resnick, 1963 E 36], p. 2.
Katter, 1963 U308], p. 24.
Swanson, 1960[587], p. 1099.
148