IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 10 paragraphs or within sentences), and within a set proximity of each other. Fagan investigated the use of many different parameters for selecting these phrases, and then added the phrases as supplemental index terms, i.e. all single terms were first indexed and then some additional phrases were produced. Fagan (1987) also examined the use of complete syntactic parsing to generate phrases. The parser generated syntactic parse trees for each sentence and the phrases were then defined as subtrees of those parse trees that met certain structural criteria. Salton eL al. (1989, 1990b) compared the phrases generated for two book chapters both by the statistical methods and the syntactic methods and found that both methods generated many correct phrases, but that the overlap of those phrases was small. Salton eL al. (1990c) also tried a syntactic tagger and bracketer (Church 1988) to identify phrases. The tagger uses statistical methods to produce syntactic part[OCRerr]of-speech tags, and the bracketer identifies phrases consisting of noun and adjective sequences. This simpler approach does not require the completion of entire parse trees and seemed to produce as many good phrases. In general, retrieval experiments that add phrases to single term indexing have not been successful with small test collections. One reason has been the scarcity of phrases[OCRerr]in the text that match phrases in the query. Lewis & Croft (1990) tried first locating phrases using a chart parser, and then clustering these phrses. The retrieval used single terms, phrases, and clustered phrases in different combinations. The best performance used terms, phrases, and clustered phrases as features for retrieval. However even this performance was not significantly better than performance using only single terms for the small test collection used. The current feeling among researchers is that the use of multiple-word phrases will be successful only for large collections of texL This is parially because of the need for enough text to locate phrases that will be good features for retrieval. Equally important, the higher precision retrieval offered by phrases may only be important in the larger full-text retrieval environment. Croft el al. (1991) investigated various ways of both gen- erating and us[OCRerr]ing Phrses in retrieval, and although their results. on the small CA: CM test collection weTe not significant, the work they are doing on a larger test collection shows impressive results using phrases. It is likely that the use of phrases for retrieval in large full-text retrieval environments will show significant, and pos- sibly critical, improvements over single term indexing. 3A Feature selection Another method of improving precision in retrieval from large full-text data is to select indexing features more carefully. The current approach to automatic indexing generally indexes all stems in a document, elim- inating only stopwords and possibly numbers. This exhaustive coverage may be important for small documents such as abstracts or bibliographic records, but using all terms in very large records may weaken the matching criteri[OCRerr] Ideally one would like to be able to automatically select the single terms or phrases which best represent a document. Unfortunately this area has attracted little research because of the absence of large full- text test collections. Two recent papers address this issue. The first paper (Strzalkowski 1992) described some research using a statistical retrieval system with some improvements based on natural language techniques. Strzalkowski used a very fast syntactic parser to parse the text. The phrases found using this parser were then statistically analyzed and filtered to produce automatically a set of semantic relationships between the words and subphrases. This highly selective set of phrases was then used to both expand and filter the query. The results on the small CACM collection showed a significant improvement in performance over the straight statistical methods, and these techniques clearly will scale up to larger full-text documents. The second paper (1£wis 1992) was an investigation into feature selection using a classification test collec- tion. This test collection contains 21,450 Reuters newswires that have been manually classified into 135 topic descriptions. The goal of this research was to identify what text features (terms, phrases, or phrase clusters) were important in generating these categories. Best results were obtained for a small number of features (10 - 15), and some discussion is made of the best ways to select these features. This type of approach to feature pruning also needs to be further explored for large full-text collections. 3.5 More advanced term weighting techniques A third approach to increased precision for the larger documents is to use all terms for indexing, but to pro- vide more sophisticated term weighting methods than those discussed in section 3.1. Salton & Buckley (1991)