IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
3
Commas -[OCRerr] if numbers are indexed, commas and decimal points become imporLant.
Once word boundaries are defined, an equally difficult issue is what words or tokens to index. This particu-
larly applies to the indexing of numbers. If numbers are indexed, the number of unique words can explode
because there is an unlimited set of unique numbers. As an example, when all numbers in the 50 megabyte text
collection shown in Table 1 were indexed, the number of unique terms went from 10122 (indexing no numbers)
to 55486 (indexing all numbers). Cflie number of unique terms shown in Table 1 includes indexing some
numbers as explained later). Indexing all numbers would have caused an almost doubling of the index size, and
therefore slower rpsponse times. However, not indexing the numbers can lead to major searching problems
when a number is critical to the query (such as "what were the major b[OCRerr]kthroughs in computer speed in
1986").
The same problem can apply to the indexing of single characters (0ther than the words "a" or "it', which are
discussed in the next section as stopwords). Whereas the number of unique single characters are limited, the
heavy use of single characters as initials, section labels, etc. can increase the size of the index. Again, however,
not indexing single characters can lead to searching problems for queries in whFch these characters are critical
(such as "sources of vitamin C").
The solutions to both the problem of word boundaries and what words to index involve compromises.
Before indexing is started, samples of the text to be indexed, and samples of the types of queries to be run, need
to be closely examined. This may require a prototypel'user testing operation, or may be solved by simply dis-
cussing the problem with the users. The following examples illustrate some of the possible compromises.
* The punctuation in the text should be studied, and potential problems identified so that reasonable rules of
word separation can be found. Often hyphenated words are treated both as separated words and as
hyphenated words. Other types of punctuation are handled differendy hased on preceding or succeeding
characters or spaces.
* The use of upper and lower case letters also needs to be determined. Usually upper-case letters are changed
to lower case during indexing as capitalieed words indicating sentence beginnings will not coreectly match
lower case query words. However, if proper nouns are to be treated as special terms, then upper-case letters
are necessary for proper noun recognition.
* The indexing of numbers is also heavily application dependenL Dates, section labels, and numbers com-
bined with alphabetics may be indexed, and other numbers not indexed. If hyphens can be kept, then some
number problems are eliminated (such as F-is). In the 50-megabyte text collection shown in Table 1,
numbers that were part of section labels were kept, and these were distinguished by the punctuation that
appeared in the number. Some searches were still unsuccessful, however, because of the lack of complete
number indexing.
* The indexing of single characters is somewhat easier to handle. Users can check the alphabet and note any
letters that have particular meaning in their application, and these letters can be indexed.
Most commereial systems take a conservative approach to these problems. For example, Chemical Abstracts
Service, ORBIT Search Service, and Mead Data Central's LEXIS/NEXIS systems all recognize numbers and
words containing digits as index terms, and all are case insensitive. In general they have no special provisions
for punctuation marks, although Chemical Abstracts Service keeps hyphenated words as single tokens, and the
other two systems break hyphenated words apart (Fox 1992).
2[OCRerr] Use of stop lists
Additionally most automatic indexing techniques work with a stop list that prevents certain high-frequency
or "fluff,' words from being indexed. Francis & Kucera (1982) found that the ten most frequendy used words in
the English language typically account for twenty to thirty percent of the terms in a documenL These terms use
large amounts of index storage and cause poor matches (although this is not usually a problem because of the
use of multiple query terms for matching purposes).
One commonly-used approach to building a stop list is to use one of the many lists generated in the pasL
Francis & Kucera (1982) produced a stop list of 425 words derived from the Brown corpus, and a list of 250
stopwords was published by van Rijsbergen (1975). These lists contain many of the words that always have a
high frequency, such as "a", "and", "the", and "is", but also may contain "fluff,' words that may not have a high