IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
Automatic Indexing
Donna Harman
National Institute of Standards and Technology
1. Introduction
Vast amounts of text are available online todaY, including text created for electronic access and text designed
mainly for traditional publishing. This text is not searchable without the. ability to do automatic indexing. Yet
the "discovery" that adequate indexing could be done using single terms from the text generally surprised the
library community. As Cyril aeverdon reported from the Cranfleld project (Cleverdon & Keen 1966):
"Quite the most astonishing and seemingly inexplicable conclusion that asises from the project is that the sin-
gle term indexing languages are superior to any other type...unless one is prepared to say that the whole test
conception is so much at fault that the results are completely distorted, there is no course except to auempt to
explain the results which seem to offend against every canon on which we were trained as librarians."
Today we not only accept these results, but base many of the large commercial online systems on this once-
revolutionary ide& The discovery of automatic indexing coincided with the availability of large computers and
created a major interest in automatically indexing and searching text, such as the work done by HY Luhn (1957)
in investigating the use of frequency weights in automatic indexing. Work has continued since then in various
research iabor[OCRerr]tories and has resulted in more sophisticated automatic indexing methods, [OCRerr]using [OCRerr]single terms and
using larger &hunks of text (such as phrases).
This paper was written to serve two separate goals. The first goal is to provide a tutorial on single term
indexing of "real-world" texL Therefore section 2 steps through the indexing process discussing the types of
critical issues that must be resolved during full text indexing in order to provide effective retrieval performance.
Most of these issues are straight-forward. However poor choices of indexing parameters produce systems that
would be considered failures in most applications.
The second goal is to provide some discussion of advances in automatic indexing beyond the simple single-
term indexing done in most operational retrieval systems. Section 3 discusses many of the techniques being
investigated and provides references for further reading.
2. Automatically producing simple index terms
This section presents a walk-through of the processing of an online text file to produce a list of index terms
that can be used for searching that file. These terms would be placed in an inverted file, or other data structure,
and an information search could be made against this index using Boolean retrieval operators to combine the
terms. Alternatively some of the more advanced searching methods could use these terms as input to term
weighting algorithms that produce ranked output using statistical techniques.
2.1 What constitutes a record
The first key decision for any indexing is the choice of record boundaries which identif[OCRerr] a searchable uniL A
record could be defined as an entire book, a chapter in the book, a section in that chapter, or even a paragraph.
This decision is critical for effective retrieval, both in the retrieval/display stage and in the search stage. Often
this decision is clearcut. For example if the application is searching bibliographic records as in an online cata-
log, clearly a record is one of the bibliographic records. Similarly, if the application is searching newspaper arti-
des or newswire stories for particular events, then these articles or stories each becomes a record. The choice
of record size becomes fuzzy, however, as the size of the documents being examined grows larger. If the docu-
ments being searched are long articles such as legal transcriptions of court cases or full journal articles, then the
record might still be the entire document, although this may make display and searching more difficult. How-
ever, if the documents being searched are manuals or textbooks, a record should not be the entire document.
Here the choice should depend on the retrieval and display mechanisms of the particular application. For