MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Operational Considerations
chapter
Mary Elizabeth Stevens
National Bureau of Standards
The question of stop list effectiveness therefore becomes an operational factor as well as
one that may affect the quality and acceptability of the product. On the other hand3 too
generous a purging of the input titles may of course reduce the utility of the title index by
the elimination of too many potential access points and, in particular, many that users
may be most tempted to look for.
A related problem has to do with the number of pages required because of the length
of the title line allowed in the listings. A suggestion advanced by Brandenberg (1963 [80])
is the assignment of numeric codes to the machine stop words used and the insertion of
these codes into the listed title line in the place of these presumably insignificant words.
Thus one of the KWIC entries for the title3 "Determining Aspects of the Russian Verb
from Context in Machine Translation" might go from:
RMINING ASPECT OF THE
ERMINING 03Z 416 712 RUS
CONTEXT IN MACHINE TRANSLATION. IDETE to:
CONTEXT 308 MACHINE TRANSLATION. /DET
This particular example was picked at random from a KWIC index utilizing a 103-106
character title line, 1/ but it was deliberately shortened to the 60-character line length
found in many such indexes in order to illustrate effects of chopping and wrap-around.
Coincidentally, it also illustrates some of the difficulties of designing a well-balanced
exclusion list since in this case the purged word "aspect" is apparently being used in a
technical sense rather than in the common one of "Various aspects of...". By accident,
this case does show rather severe "aspects" of the chopping problem in the loss also, for
this entry, of "Russian" and "verb" although they would of course be picked up in the entry
blocks for these words. Certainly, however, the claimed advantages of context checking
are not striking, even without the introduction of the numeric codes. It is true that for
excluded words longer in length than those in our example the possible conservation of the
character-space to reduce the chopping effects for the same length line may result in im-
provements. However, the replacement of, for example, "Preliminary investigations
of..." by numeric codes would hardly assist the user in determining quickly from the
many possible entries under "..." which he should select for further personal perusal.
Turning to the case of automatic assignment indexing, the processing considerations
likely to be involved in operational factors affecting the evaluation of a system are much
less easily exemplified. Obviously, conditions that hold for research experiments on
small (and usually, especially selected) samples do not necessarily relate to requirements
in potential productive applications. Exceptions are the problems of the sizes of term-
term and term-document co-occurrence correlation matrices that can be readily manipu-
lated, previously mentioned, 2/ and the concurrent problems of the size, and hence the
representativeness, of inclusion lists or clue-word vocabularies that can be accommodated.
Both Maron and Borko found, even in their limited test samples, a certain proportion
of new items that could not be indexed or categorized at all because these new items did
not contain any of the clue words recognizable by the system. 3/ Due perhaps to longer
selective clue word lists, as well as to the special nature of his items, Swanson found no
instances, for 775 test items, of failure to assign because of lack of indicative clues in the
input material. In the case of 60 tests against the SADSACT model, which uses approx-
imately 1, 600 words drawn from a "teaching sample" of items previously indexed to de-
scriptors, (related by frequency of co-occurrence to any of 70-odd descriptors with whose
A' Walkowicz, 1963 [629], pp. 136 and 137.
See pp. 108 and 160 of this report.
3/ See Maron, 1961 [395]; also Borko and Bernick, 1963 [78].
169