NIST Interagency Report 4873: Automatic Indexing

IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 6 In terms of performance improvements, research has shown that on the average res'[OCRerr]ults were not improved by using a stemmer. However, system performance must reflect a user's expections, and the use of a stemmer (par- ticularly die S stemmer) is intuitive to many users. The OKAPI project (Walker & Jones 1987) did extensive work on improving retrieval in online catalogs, and strongly recommended using a "weak" stemmer at all times, as the "weak" stemmer (removal of plurals, "ed" and "ing't) seldom hurt performance, but provided significant improvement. They found drops in precision for some queries using a "strong" stemmer (a variation of the Porter algorithm), and therefore recommended the use of a "strong" stemmer only when no matches were found. Gne method of selective stemming is the avallability of truncation in many online commercial retrieval systems. However1 Frakes (1984) found that automatic stemming performed as well as truncation by an experienced user, and most user studies show liule actual use of truncation. Given today's retrieval speed and the ability for user interaction, a realistic approach for online retrieval would be the automatic use of a stemmer, using an algorithm like Porter or Lovins, but providing the ability to keep a term from being stemmed (the inverse of truncation). If a user found that a term in the stemmed query produced too many nonrelevant documents, the query could be resubmitted with that term marked for no stemming. In this manner, users would have full advantage of stem- ming, but would be able to improve the results of those queries hurt by stemmin[OCRerr]. 3. Advanced automatic indexing techniques The basic index terms produced by the methods discussed in section 2 can be used "as is", with IOoolean connectors to combine terms, or a single term may be used for simple searches. However, researchers in infor- mation retrieval have been developing more complex automatic indexing techniques for over thirty years, and having varying degrees of success with these new techniques doing experiments with small test collections. Some of these techniques (such as the term weighting discussed in section 3.1) are clearly successful and are likely to scale easily into large full-text documents. Other techniques, such as the query expansion techniques described in section 3.2, do well 9n small test collections, but may lied additional experimexitation when used in large full-text collections. The added discrimination provided by using phrases as indexing terms rather than only single terms is discussed in section 3.3. In general the use of phrases has not been successful in small test collections, but is likely to become more useful, or even critical, in large full-text documents. Large full-text collections may need better term discrimination measures, and some recent experiments in selecting better index- ing features or in providing more advanced term weighting are described in sections 3A and 3.5. Finally, the notion of combining evidence from multiple types of document indexing is presented in section 3.6. 3.1 Term weighting Whereas terms coming from automatic indexing can be used without weights, they offer the opportunity to do automatic term weighting. This weighting is essential to all systems doing statistical or probabilistic ranking. Many of the commercial systems provide an ability to rank documents based on the number of terms matching between the query and the document, but find that users do not select this option often because of poor perfor- mance. There are several reasons for this poor performance: 1. There is no technique for resolving ties. If there are three words in a query, it may be that only a few documents match all three words, but many will match two terms, arid these documents are essentially unranked with respect to each other. 2. There is no allowance for word importance within a text collection. A query such as "term weighting in information retrieval" could return a single document containing all four non-common words, and then an unranked list of documents containing the two words "term" and "weighting" or "information" and "retrieval", all in random order. This could mean that the possibly 10 documents containing "term" and "weighting" are buried in 500 documents containing "information" and "retrieval". 3. There is no allowance for word importance within a documenL Looking again at the query "term weight- ing in information retrieval", the cor[OCRerr]t order of the documents containing "term" and "weighting" would be by frequency of "weighting" within a document, so that the highest ranked document contains multiple instances of "weighting", not just a single instance.