IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
11
presented results from work using an online encyclopedia in which they weighted terms both globally for an
entire document (as in section 3.1), but also locally for a given sentence. In this particular experiment they per-
formed multiple-stage searching in which a short initial query was used to find one or more relevant sections or
paragraphs, and then these sections were used to find similar sections using both global and local weighting
schemes. Whereas the global weights help increase the recall by returning many similar items, the local weights
can be used as a filtering operation to improve the precision of the returned seL Further details can be found in
a technical report (Salton & Buckley 1990d). This type of approach to searching and term weighting may be
particularly suitable for large lull-text data collections.
3.6 Using combinations of indexing techniques
All the preceding research efforts had as a basis the combination of various information from the text to
improve indexing and searching. The best term weighting schemes discussed in section 3.1 combined different
statistical measures of term importance. The section on query expansion dealt with combining information
about term co-ccurrence to automatically identify better query terms and term weights. The work on multiple
word phrases investigated how to locate phrases, but also how to correctly combine these phrases with single
terms. Feature selection involves combining information from the text to help better select which features to
index, and the advanced term weighting techniques combine term weights at two granularity levels to improve
precision.
Other more explicit combination techniques have been tried, from simple user weighting of terms (to be
combined with the statistical term weighting), to combining of database attributes with free text (Deogun &
Raghavan 1988), to more elaborate combining of concepts such as citations, attributes, and data into the vector
space model [OCRerr]ox eL al. 1988). Results have generally shown improvements in performance, even for small test
collections. This combination of various sources of information can be extended to combining various types of
indexing (such as manual or automatic),. Various types of queries (such as using or not using Boolean connec-
tors), or various types of searching (such as cluster searching vs document searching). It has been shown
(Katzer CL al. 1982) that different indexing or searching methods can produce comparable results, but with liale
overlap between the sets of relevant documents. Clearly it would be ideal to combine these methods, but the
method for combining the completely different approaches to indexing and searching is not easily apparenL
A new model, the inference network CFurtle and Croft 1991) is designed specifically for this task of combin-
ing evidence or probabilities from all these different methods. This network consists of term nodes, document
nodes, and query nodes, connected by finks with probabilitistic weighting factors, and can be used to try multi-
ple ways of combining information from these nodes to form a list of documents ranked in order of likely
relevance to a user's need. Turtle & Croft show how this model can be used to represent most of the basic
indexing and searching techniques, and discuss how the generation of this model provides the scope for a
thorough investions of how to perform complex combinations of techniques. This type of representation can be
viewed as a very advanced indexing method, and may prove important in handling large full-text data[OCRerr]
3.7 Summary
Whereas the traditional automatic single term indexing described in section 2 enables reasonable searching of
large full-text documents, the more advanced techniques discussed in this section may all prove important in
raising the retrieval performance beyond a mediocre level. It is critical that research continue into these
advanced techniques, and others like them, and that as they become proven methodologies, they be accepted as
standard automatic indexing techniques by the information retrieval community as a whole.
REFERENCES
Burgin R. and Dillon M. (1992). Improving Disambiguation in FASIT. Journal of the American Society for
Jnformation Science, 43(2), 101-114.
Church K. (1988). A Stochastic Part Program and Noun Phrase Parser for Unrestricted Text. In: Proceedings of
the Second Conference on Applied Natural [OCRerr]nguage Processing; 1988, 13&143; Austin, Texas.
Cleverdon C.W. and Keen E.M. (1966). Factors Determining the Performance of Indexing Systems, Vol.1: