IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 9 32 Query expansion One of the problems found in all information retrieval Systems is that relevant documents are missed because they contain no terms from the query. Whereas often users do not want to find most of the relevant documents, sometimes they want to find many more relevant documents and are willing to examine more documents in hopes of finding more relevant ones. However the automatic indexing systems generally do not offer the "higher-level" terms describing a document that could have been manually assigned, and it is difficult to gen- crate a more exhaustive search. One way around this difficulty is to provide tools for query expansion. A sim- pie example of such a tool would be the ability to browse the dictionary or word list for the text[OCRerr]collection. Two more sophisticated techniques would be the use of relevance feedback or the use of an automatically- constructed thesaurus. Relevance feedback is a technique that allows users to select a few relevant documents and then ask the sys- tem to use these documents to improve performance, i.e. retrieve more relevant documents. There has been a significant amount of research into using this method, although there are few user experiments on large test col- lections. Salton & Buckley (199O[OCRerr]) showed that adding relevance feedback to their similarity measure results in up to 100% improvement for small test collections. Croft (1983) used the relevant and nonrelevant documents to probabilistically change the term weighting and in 1990 he extended this work by also expanding queries using terms in the relevant documents. A similar approach was taken by Harman (199[OCRerr]) and these results (again for a small test collection) showed improvements of around 100% in performance. Clearly the use of relevance judgments to improve performance is important in full-text searching and can supplement the use of the basic automatically-indexed terms, but the exact methods of using these relevance judgments is still to be determined for large full-text documents. Possibly their best use is in providing an interactive tool for modify- mg the query by suggesting new terms. For a survey of the use of relevance feedback in experimental retrieval systems, including Boolean systems, see Harman (1992b). A different method of query expansion could be the use of a thesaurus. This thesaurus could be used as a browsing tool, or could be incorporated automatically in some manner. The bullding of such a thesaurus, how- ever, is a massive, often domain-dependent task[OCRerr] Some research has been done into automatically building a thesaurus. Sparck Jones & Jackson (1970) experimented with clustering terms based on co-occurrance of these terms in documents. They tried several different clustering techniques and several different methods of using these clusters on the manually-indexed Cranfield collection. The major results on this small test collection showed that 1) it is important NOT to cluster high frequency terms (they became unit clusters), 2) it is important to create small clusters, and 3) it is better to search using the clusters alone rather than a "mixed-mode" of clus- ters and single terms. Crouch (1988) also generated small clusters of low freqency terms, but had good results searching using query terms augmented by thesaurus classes. Careful attention was paid to properly weighting these additional "terms". It is of course unknown how these results scale up to large full-text collections, but the concept seems promising enough to encourage further experimentation. 3[OCRerr] The use of multiple-word phrases for indexing L[OCRerr]ge full-text collections not only need special query expansion devices to improve recall (the percentage of total relevant documents retrieved), but also need precision devices to improve their accuracy. One important precision device is the term weighting discussed in section 3.1. The ability to provide ranked output improves precision because users are no longer looking at a random ordering of selected documents. However further improvement in precision may be necessary for searching in large full-text collections, and oue way to get addi- tional accuracy is to require more stringent matching, such as phrase matching. Phrase matching has been used in experiments in information retrieval for many years, but has currently got- ten more attention because of improvements in natural language technology. The initial phrase matching used templates (Weiss 1970) rather than deep natural language parsing algorithms. The FASIT system ([)illon & Gray 1983; Burgin & Dillon 1992) used template matching by creating a dictionary of syntactic category pat- terns and using this dictionary to locate phrases. They assigned syntactic categories by using a suffix dictionary and exception list. The phrases detected by this system were normalized and then merged into concept groups for the final matching with queries. A second type of phrase detection method that is based purely on statistics was investigated by Fagan (1987, 1989). This type of system relies on statistical co-occurrances of terms, as did the automatic thesaurus building described in section 3.2, but requires that these terms co-occur in more limited domains (such as within