IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
9
32 Query expansion
One of the problems found in all information retrieval Systems is that relevant documents are missed because
they contain no terms from the query. Whereas often users do not want to find most of the relevant documents,
sometimes they want to find many more relevant documents and are willing to examine more documents in
hopes of finding more relevant ones. However the automatic indexing systems generally do not offer the
"higher-level" terms describing a document that could have been manually assigned, and it is difficult to gen-
crate a more exhaustive search. One way around this difficulty is to provide tools for query expansion. A sim-
pie example of such a tool would be the ability to browse the dictionary or word list for the text[OCRerr]collection.
Two more sophisticated techniques would be the use of relevance feedback or the use of an automatically-
constructed thesaurus.
Relevance feedback is a technique that allows users to select a few relevant documents and then ask the sys-
tem to use these documents to improve performance, i.e. retrieve more relevant documents. There has been a
significant amount of research into using this method, although there are few user experiments on large test col-
lections. Salton & Buckley (199O[OCRerr]) showed that adding relevance feedback to their similarity measure results in
up to 100% improvement for small test collections. Croft (1983) used the relevant and nonrelevant documents
to probabilistically change the term weighting and in 1990 he extended this work by also expanding queries
using terms in the relevant documents. A similar approach was taken by Harman (199[OCRerr]) and these results
(again for a small test collection) showed improvements of around 100% in performance. Clearly the use of
relevance judgments to improve performance is important in full-text searching and can supplement the use of
the basic automatically-indexed terms, but the exact methods of using these relevance judgments is still to be
determined for large full-text documents. Possibly their best use is in providing an interactive tool for modify-
mg the query by suggesting new terms. For a survey of the use of relevance feedback in experimental retrieval
systems, including Boolean systems, see Harman (1992b).
A different method of query expansion could be the use of a thesaurus. This thesaurus could be used as a
browsing tool, or could be incorporated automatically in some manner. The bullding of such a thesaurus, how-
ever, is a massive, often domain-dependent task[OCRerr] Some research has been done into automatically building a
thesaurus. Sparck Jones & Jackson (1970) experimented with clustering terms based on co-occurrance of these
terms in documents. They tried several different clustering techniques and several different methods of using
these clusters on the manually-indexed Cranfield collection. The major results on this small test collection
showed that 1) it is important NOT to cluster high frequency terms (they became unit clusters), 2) it is important
to create small clusters, and 3) it is better to search using the clusters alone rather than a "mixed-mode" of clus-
ters and single terms. Crouch (1988) also generated small clusters of low freqency terms, but had good results
searching using query terms augmented by thesaurus classes. Careful attention was paid to properly weighting
these additional "terms". It is of course unknown how these results scale up to large full-text collections, but the
concept seems promising enough to encourage further experimentation.
3[OCRerr] The use of multiple-word phrases for indexing
L[OCRerr]ge full-text collections not only need special query expansion devices to improve recall (the percentage of
total relevant documents retrieved), but also need precision devices to improve their accuracy. One important
precision device is the term weighting discussed in section 3.1. The ability to provide ranked output improves
precision because users are no longer looking at a random ordering of selected documents. However further
improvement in precision may be necessary for searching in large full-text collections, and oue way to get addi-
tional accuracy is to require more stringent matching, such as phrase matching.
Phrase matching has been used in experiments in information retrieval for many years, but has currently got-
ten more attention because of improvements in natural language technology. The initial phrase matching used
templates (Weiss 1970) rather than deep natural language parsing algorithms. The FASIT system ([)illon &
Gray 1983; Burgin & Dillon 1992) used template matching by creating a dictionary of syntactic category pat-
terns and using this dictionary to locate phrases. They assigned syntactic categories by using a suffix dictionary
and exception list. The phrases detected by this system were normalized and then merged into concept groups
for the final matching with queries.
A second type of phrase detection method that is based purely on statistics was investigated by Fagan (1987,
1989). This type of system relies on statistical co-occurrances of terms, as did the automatic thesaurus building
described in section 3.2, but requires that these terms co-occur in more limited domains (such as within