IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
6
In terms of performance improvements, research has shown that on the average res'[OCRerr]ults were not improved by
using a stemmer. However, system performance must reflect a user's expections, and the use of a stemmer (par-
ticularly die S stemmer) is intuitive to many users. The OKAPI project (Walker & Jones 1987) did extensive
work on improving retrieval in online catalogs, and strongly recommended using a "weak" stemmer at all times,
as the "weak" stemmer (removal of plurals, "ed" and "ing't) seldom hurt performance, but provided significant
improvement. They found drops in precision for some queries using a "strong" stemmer (a variation of the
Porter algorithm), and therefore recommended the use of a "strong" stemmer only when no matches were found.
Gne method of selective stemming is the avallability of truncation in many online commercial retrieval systems.
However1 Frakes (1984) found that automatic stemming performed as well as truncation by an experienced user,
and most user studies show liule actual use of truncation. Given today's retrieval speed and the ability for user
interaction, a realistic approach for online retrieval would be the automatic use of a stemmer, using an algorithm
like Porter or Lovins, but providing the ability to keep a term from being stemmed (the inverse of truncation).
If a user found that a term in the stemmed query produced too many nonrelevant documents, the query could be
resubmitted with that term marked for no stemming. In this manner, users would have full advantage of stem-
ming, but would be able to improve the results of those queries hurt by stemmin[OCRerr].
3. Advanced automatic indexing techniques
The basic index terms produced by the methods discussed in section 2 can be used "as is", with IOoolean
connectors to combine terms, or a single term may be used for simple searches. However, researchers in infor-
mation retrieval have been developing more complex automatic indexing techniques for over thirty years, and
having varying degrees of success with these new techniques doing experiments with small test collections.
Some of these techniques (such as the term weighting discussed in section 3.1) are clearly successful and are
likely to scale easily into large full-text documents. Other techniques, such as the query expansion techniques
described in section 3.2, do well 9n small test collections, but may lied additional experimexitation when used
in large full-text collections. The added discrimination provided by using phrases as indexing terms rather than
only single terms is discussed in section 3.3. In general the use of phrases has not been successful in small test
collections, but is likely to become more useful, or even critical, in large full-text documents. Large full-text
collections may need better term discrimination measures, and some recent experiments in selecting better index-
ing features or in providing more advanced term weighting are described in sections 3A and 3.5. Finally, the
notion of combining evidence from multiple types of document indexing is presented in section 3.6.
3.1 Term weighting
Whereas terms coming from automatic indexing can be used without weights, they offer the opportunity to
do automatic term weighting. This weighting is essential to all systems doing statistical or probabilistic ranking.
Many of the commercial systems provide an ability to rank documents based on the number of terms matching
between the query and the document, but find that users do not select this option often because of poor perfor-
mance. There are several reasons for this poor performance:
1. There is no technique for resolving ties. If there are three words in a query, it may be that only a few
documents match all three words, but many will match two terms, arid these documents are essentially
unranked with respect to each other.
2. There is no allowance for word importance within a text collection. A query such as "term weighting in
information retrieval" could return a single document containing all four non-common words, and then an
unranked list of documents containing the two words "term" and "weighting" or "information" and
"retrieval", all in random order. This could mean that the possibly 10 documents containing "term" and
"weighting" are buried in 500 documents containing "information" and "retrieval".
3. There is no allowance for word importance within a documenL Looking again at the query "term weight-
ing in information retrieval", the cor[OCRerr]t order of the documents containing "term" and "weighting" would
be by frequency of "weighting" within a document, so that the highest ranked document contains multiple
instances of "weighting", not just a single instance.