MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
3.4 Quality of Modified Derivative Indexing by Machine
Most of the modified derivative indexing techniques that have been proposed to date
have few or no indexing results to provide comparative data for purposes of evaluation.
Moreover, those techniques which are primarily directed to the generation of document
abstracts rather than indexing terms have been reported to date with a paucity of actual
examples. I' One of the main reasons for this lack of product[OCRerr]effectiveness data is un-
questionably the high cost and difficulty of obtaining substantial corpora of representative
document text in machine-readable form. For the most part, the few examples of
automatic abstracts produced by machine are sadly lacking in pertinency, relevancy, 2/
and in continuity for scanning or reading by comparison with conventional human abstracts,
whether prepared by author, editor, volunteer specialist in the subject field, or pro-
fes sional documentalist.
A few studies have been made for a somewhat larger numbers of examples of "auto-
abstracts'1 with respect to differences between several different machine-extraction
formulas, random sentence selections, and sentences extracted manually. A project
conducted by IBM's Advanced Systems Development Division for the ACSI-matic program,
(1960 [289], 1961 [290]), involved 70 to 90 articles on military intelligence items. The
comparisons were of "auto-abstracts'1 as against titles, full texts, "pseudo-auto-
abstracts1 comprised of the first and last 5 percent of the sentences of each text, and
sets of sentences selected randomly, without reference to conventional types of manually
prepared abstracts and without respect to the quality as such. Similarly, Thompson
Ramo Wooldridge data (1963 [601]) on machine-extracted and randomly-extracted,
sentence sets compare these "abstracts" against manual selection of 25 percent of the
sentences of each item, rather than against a conventional type of abstract.
There are however, almost no data available on the possible results of using sentence
and word-group extracting techniques, applied to machine-usable texts, to the develop-
ment of indexing entries rather than to the generation of substitutes for document
abstracts. For this reason, as well as because discussion of the difficulties of evaluation
in general will be deferred to a later section of this report, the question of the quality of
modified derivate indexing will be briefly considered below, largely in terms of non-
quantitative judgments.
First and foremost, as has been noted previously, is the objection that word-indexing
typically produces redundancy, scatter of references among synonyms and near-synonyms,
inclusion of many irrelevant entries at high page and user-scanning costs, omission of
1/
Purto expresses regret that the studies of Agrayev and Borodin, intercomparing
results of human abstracting, use of Luhn's method, and their own modification,
used only a single paper (1962 [484]). Storm, (1961 [577]), evaluating the initial
noun occurrence technique as a measure of sentence and index-term extraction
significance, reports results for only two papers, both by Quine. Only nine
articles, with no more than 40, 000 words of text in toto, were used by Edmundson,
Oswald and Wyllys in their 1960 experiments ([180]).
2/
Compare, for example Lesk and Storm, 1961 [358], pp. 1-29 and 1-30 as follows:
"A final problem is the ambiguity that may arise by removing two sentences from
context; two sentences alone do not always permit comprehension. Worse yet, the
meaning may actually be inverted upon removal from context. For example. . . a
quote is selected which an unsuspecting reader might think the author supports,
when he is really attacking the position."
89