IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
10
paragraphs or within sentences), and within a set proximity of each other. Fagan investigated the use of many
different parameters for selecting these phrases, and then added the phrases as supplemental index terms, i.e. all
single terms were first indexed and then some additional phrases were produced.
Fagan (1987) also examined the use of complete syntactic parsing to generate phrases. The parser generated
syntactic parse trees for each sentence and the phrases were then defined as subtrees of those parse trees that
met certain structural criteria. Salton eL al. (1989, 1990b) compared the phrases generated for two book
chapters both by the statistical methods and the syntactic methods and found that both methods generated many
correct phrases, but that the overlap of those phrases was small. Salton eL al. (1990c) also tried a syntactic
tagger and bracketer (Church 1988) to identify phrases. The tagger uses statistical methods to produce syntactic
part[OCRerr]of-speech tags, and the bracketer identifies phrases consisting of noun and adjective sequences. This
simpler approach does not require the completion of entire parse trees and seemed to produce as many good
phrases.
In general, retrieval experiments that add phrases to single term indexing have not been successful with
small test collections. One reason has been the scarcity of phrases[OCRerr]in the text that match phrases in the query.
Lewis & Croft (1990) tried first locating phrases using a chart parser, and then clustering these phrses. The
retrieval used single terms, phrases, and clustered phrases in different combinations. The best performance used
terms, phrases, and clustered phrases as features for retrieval. However even this performance was not
significantly better than performance using only single terms for the small test collection used.
The current feeling among researchers is that the use of multiple-word phrases will be successful only for
large collections of texL This is parially because of the need for enough text to locate phrases that will be
good features for retrieval. Equally important, the higher precision retrieval offered by phrases may only be
important in the larger full-text retrieval environment. Croft el al. (1991) investigated various ways of both gen-
erating and us[OCRerr]ing Phrses in retrieval, and although their results. on the small CA: CM test collection weTe not
significant, the work they are doing on a larger test collection shows impressive results using phrases. It is
likely that the use of phrases for retrieval in large full-text retrieval environments will show significant, and pos-
sibly critical, improvements over single term indexing.
3A Feature selection
Another method of improving precision in retrieval from large full-text data is to select indexing features
more carefully. The current approach to automatic indexing generally indexes all stems in a document, elim-
inating only stopwords and possibly numbers. This exhaustive coverage may be important for small documents
such as abstracts or bibliographic records, but using all terms in very large records may weaken the matching
criteri[OCRerr] Ideally one would like to be able to automatically select the single terms or phrases which best
represent a document. Unfortunately this area has attracted little research because of the absence of large full-
text test collections.
Two recent papers address this issue. The first paper (Strzalkowski 1992) described some research using a
statistical retrieval system with some improvements based on natural language techniques. Strzalkowski used a
very fast syntactic parser to parse the text. The phrases found using this parser were then statistically analyzed
and filtered to produce automatically a set of semantic relationships between the words and subphrases. This
highly selective set of phrases was then used to both expand and filter the query. The results on the small
CACM collection showed a significant improvement in performance over the straight statistical methods, and
these techniques clearly will scale up to larger full-text documents.
The second paper (1£wis 1992) was an investigation into feature selection using a classification test collec-
tion. This test collection contains 21,450 Reuters newswires that have been manually classified into 135 topic
descriptions. The goal of this research was to identify what text features (terms, phrases, or phrase clusters)
were important in generating these categories. Best results were obtained for a small number of features (10 -
15), and some discussion is made of the best ways to select these features. This type of approach to feature
pruning also needs to be further explored for large full-text collections.
3.5 More advanced term weighting techniques
A third approach to increased precision for the larger documents is to use all terms for indexing, but to pro-
vide more sophisticated term weighting methods than those discussed in section 3.1. Salton & Buckley (1991)