IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 11 presented results from work using an online encyclopedia in which they weighted terms both globally for an entire document (as in section 3.1), but also locally for a given sentence. In this particular experiment they per- formed multiple-stage searching in which a short initial query was used to find one or more relevant sections or paragraphs, and then these sections were used to find similar sections using both global and local weighting schemes. Whereas the global weights help increase the recall by returning many similar items, the local weights can be used as a filtering operation to improve the precision of the returned seL Further details can be found in a technical report (Salton & Buckley 1990d). This type of approach to searching and term weighting may be particularly suitable for large lull-text data collections. 3.6 Using combinations of indexing techniques All the preceding research efforts had as a basis the combination of various information from the text to improve indexing and searching. The best term weighting schemes discussed in section 3.1 combined different statistical measures of term importance. The section on query expansion dealt with combining information about term co-ccurrence to automatically identify better query terms and term weights. The work on multiple word phrases investigated how to locate phrases, but also how to correctly combine these phrases with single terms. Feature selection involves combining information from the text to help better select which features to index, and the advanced term weighting techniques combine term weights at two granularity levels to improve precision. Other more explicit combination techniques have been tried, from simple user weighting of terms (to be combined with the statistical term weighting), to combining of database attributes with free text (Deogun & Raghavan 1988), to more elaborate combining of concepts such as citations, attributes, and data into the vector space model [OCRerr]ox eL al. 1988). Results have generally shown improvements in performance, even for small test collections. This combination of various sources of information can be extended to combining various types of indexing (such as manual or automatic),. Various types of queries (such as using or not using Boolean connec- tors), or various types of searching (such as cluster searching vs document searching). It has been shown (Katzer CL al. 1982) that different indexing or searching methods can produce comparable results, but with liale overlap between the sets of relevant documents. Clearly it would be ideal to combine these methods, but the method for combining the completely different approaches to indexing and searching is not easily apparenL A new model, the inference network CFurtle and Croft 1991) is designed specifically for this task of combin- ing evidence or probabilities from all these different methods. This network consists of term nodes, document nodes, and query nodes, connected by finks with probabilitistic weighting factors, and can be used to try multi- ple ways of combining information from these nodes to form a list of documents ranked in order of likely relevance to a user's need. Turtle & Croft show how this model can be used to represent most of the basic indexing and searching techniques, and discuss how the generation of this model provides the scope for a thorough investions of how to perform complex combinations of techniques. This type of representation can be viewed as a very advanced indexing method, and may prove important in handling large full-text data[OCRerr] 3.7 Summary Whereas the traditional automatic single term indexing described in section 2 enables reasonable searching of large full-text documents, the more advanced techniques discussed in this section may all prove important in raising the retrieval performance beyond a mediocre level. It is critical that research continue into these advanced techniques, and others like them, and that as they become proven methodologies, they be accepted as standard automatic indexing techniques by the information retrieval community as a whole. REFERENCES Burgin R. and Dillon M. (1992). Improving Disambiguation in FASIT. Journal of the American Society for Jnformation Science, 43(2), 101-114. Church K. (1988). A Stochastic Part Program and Noun Phrase Parser for Unrestricted Text. In: Proceedings of the Second Conference on Applied Natural [OCRerr]nguage Processing; 1988, 13&143; Austin, Texas. Cleverdon C.W. and Keen E.M. (1966). Factors Determining the Performance of Indexing Systems, Vol.1: