NIST Interagency Report 4873:
Automatic Indexing


This publication was made available from a scan/ocr process. Hence, portions of the text are in error, particularly tables.

                               Automatic Indexing

                                  Donna Harman
                    National Institute of Standards and Technology


1. Introduction
  Vast amounts of text are available online todaY, including text created for electronic access and text designed
mainly for traditional publishing. This text is not searchable without the. ability to do automatic indexing. Yet
the "discovery" that adequate indexing could be done using single terms from the text generally surprised the
library community. As Cyril aeverdon reported from the Cranfleld project (Cleverdon & Keen 1966):

 "Quite the most astonishing and seemingly inexplicable conclusion that asises from the project is that the sin-
 gle term indexing languages are superior to any other type...unless one is prepared to say that the whole test
 conception is so much at fault that the results are completely distorted, there is no course except to auempt to
 explain the results which seem to offend against every canon on which we were trained as librarians."

Today we not only accept these results, but base many of the large commercial online systems on this once-
revolutionary ide& The discovery of automatic indexing coincided with the availability of large computers and
created a major interest in automatically indexing and searching text, such as the work done by HY Luhn (1957)
in investigating the use of frequency weights in automatic indexing. Work has continued since then in various
research iabor~tories and has resulted in more sophisticated automatic indexing methods, ~using ~single terms and
using larger &hunks of text (such as phrases).
  This paper was written to serve two separate goals. The first goal is to provide a tutorial on single term
indexing of "real-world" texL Therefore section 2 steps through the indexing process discussing the types of
critical issues that must be resolved during full text indexing in order to provide effective retrieval performance.
Most of these issues are straight-forward. However poor choices of indexing parameters produce systems that
would be considered failures in most applications.
  The second goal is to provide some discussion of advances in automatic indexing beyond the simple single-
term indexing done in most operational retrieval systems. Section 3 discusses many of the techniques being
investigated and provides references for further reading.

2. Automatically producing simple index terms
  This section presents a walk-through of the processing of an online text file to produce a list of index terms
that can be used for searching that file. These terms would be placed in an inverted file, or other data structure,
and an information search could be made against this index using Boolean retrieval operators to combine the
terms. Alternatively some of the more advanced searching methods could use these terms as input to term
weighting algorithms that produce ranked output using statistical techniques.

2.1 What constitutes a record
  The first key decision for any indexing is the choice of record boundaries which identif~ a searchable uniL A
record could be defined as an entire book, a chapter in the book, a section in that chapter, or even a paragraph.
This decision is critical for effective retrieval, both in the retrieval/display stage and in the search stage. Often
this decision is clearcut. For example if the application is searching bibliographic records as in an online cata-
log, clearly a record is one of the bibliographic records. Similarly, if the application is searching newspaper arti-
des or newswire stories for particular events, then these articles or stories each becomes a record. The choice
of record size becomes fuzzy, however, as the size of the documents being examined grows larger. If the docu-
ments being searched are long articles such as legal transcriptions of court cases or full journal articles, then the
record might still be the entire document, although this may make display and searching more difficult. How-
ever, if the documents being searched are manuals or textbooks, a record should not be the entire document.
Here the choice should depend on the retrieval and display mechanisms of the particular application. For

                                                                                2


example the application of searching an online manual might have a record defined as the lowest subsection, so
that users find and display very exact subsections of material. If the application is to provide pointers into paper
copies of long articles (such as 1Oo-page+ court cases), it might be reasonable to make each page or small sec-
tion a record so that the display could show a one-line sentence with the hits, and give the page number.
  The choice of record size is not only important for display, but also is critical for effective searching. A
record which is too short provides little text for the searching algorithms to use and will cause poor results. Too
large a record, however, may dilute the importance of word matches, and cause many fitlse matches. For these
reasons it would not be reasonable to make a sentence a record, but paragraphs might be fine as records. Alter-
natively it would not be effective to make a very long section a record; it would be better to break it into
smaller subsections.  Further, the choice of record size may also affect the choice of term weighting and
retrieval algorithms (see section 3.1 on term weighting).
  A recent paper (Harman & Candela 1990) shows some possible record size decisions and their consequences.
Three different text collections were involved in user testing of a retrieval system using automatic indexing and
statistical ralaing. The first text collection was small (1.6 megabytes) and consisted of a~manual organized into
sections and chapters. A record was determined to be equivalent to a paragraph in this manual, because this
appeared to be the most useful record size for the end users. This decision caused many short records (see
Table 1). The second text collection was a legal code book, with sections and subsections. Here the records
were set to be each subsection, again based on user preference. The records were therefore much larger, with
many words occurring multiple times within each record. The third text collection consisted of about 40,000
court cases. A record here was set to be a court case. Table 1 shows some basic statistics on these text collec-
tions. The average number of terms per record includes duplicate terms and is a measure of the record length
rather than the number of unique term occurrences. The average number of postings per term is the average
number of documents containing that term.

                                   TABLE 1
                                Collection Statistics
                      Size of collection 1.6 MB 50 MB 806 MB
                      Number of records 2653  6652   38304
                      Average number of
                      terms per record 96     1124   3264
                      Number of unique 5123   25129  243470
                          terms
                      Average postings 14     40      88
                         per_term   _________ ________ _________

2.2 What constitutes a word and what "wordstl to index

  The second key decision for any indexing is the choice of what constitutes a
words to index. In manual indexing systems this choice is easily made by the
automatic indexing it is necessary to define what punctuation should be used as
what "words" to index.

word and then which of these
human indexer. However for
word separators and to define

    Normally word separators include all white spaces and all punctuation. However there &e many exceptions
 to this rule, and, depending on the application and the searching software, the methods of handling these excep-
 tions can be crucial to successful retrieval. The following examples illustrate some of the problems encountered
 in typical applications.
    Hyphens -- some words can appear in both hyphenated and unhyphenated versions. Sometimes the treatment
    of hyphens is critical to retrieval, such as in chemical names and other normally hyphenated elements
    ~lycol-sebacic, F-iS, MS-DOS, etc.).
    Periods -- periods can appear as a part of a word, such as computer file names (paper.versionl), subsection
    titles (1.367A), and in company names.
 *  Slashes, parentheses, underscores -- these can appear as parts of words (OS/2), as parts of section titles
    (367(A)), and as parts of terms in programming languages (doc_no).

                                                                             3


   Commas -~ if numbers are indexed, commas and decimal points become imporLant.
   Once word boundaries are defined, an equally difficult issue is what words or tokens to index. This particu-
 larly applies to the indexing of numbers. If numbers are indexed, the number of unique words can explode
 because there is an unlimited set of unique numbers. As an example, when all numbers in the 50 megabyte text
 collection shown in Table 1 were indexed, the number of unique terms went from 10122 (indexing no numbers)
 to 55486 (indexing all numbers). Cflie number of unique terms shown in Table 1 includes indexing some
 numbers as explained later). Indexing all numbers would have caused an almost doubling of the index size, and
 therefore slower rpsponse times. However, not indexing the numbers can lead to major searching problems
 when a number is critical to the query (such as "what were the major b~kthroughs in computer speed in
 1986").
   The same problem can apply to the indexing of single characters (0ther than the words "a" or "it', which are
 discussed in the next section as stopwords). Whereas the number of unique single characters are limited, the
 heavy use of single characters as initials, section labels, etc. can increase the size of the index. Again, however,
 not indexing single characters can lead to searching problems for queries in whFch these characters are critical
 (such as "sources of vitamin C").
   The solutions to both the problem of word boundaries and what words to index involve compromises.
 Before indexing is started, samples of the text to be indexed, and samples of the types of queries to be run, need
 to be closely examined. This may require a prototypel'user testing operation, or may be solved by simply dis-
 cussing the problem with the users. The following examples illustrate some of the possible compromises.
 * The punctuation in the text should be studied, and potential problems identified so that reasonable rules of
   word separation can be found. Often hyphenated words are treated both as separated words and as
   hyphenated words. Other types of punctuation are handled differendy hased on preceding or succeeding
   characters or spaces.
 * The use of upper and lower case letters also needs to be determined. Usually upper-case letters are changed
   to lower case during indexing as capitalieed words indicating sentence beginnings will not coreectly match
   lower case query words. However, if proper nouns are to be treated as special terms, then upper-case letters
   are necessary for proper noun recognition.
 * The indexing of numbers is also heavily application dependenL Dates, section labels, and numbers com-
   bined with alphabetics may be indexed, and other numbers not indexed. If hyphens can be kept, then some
   number problems are eliminated (such as F-is). In the 50-megabyte text collection shown in Table 1,
   numbers that were part of section labels were kept, and these were distinguished by the punctuation that
   appeared in the number. Some searches were still unsuccessful, however, because of the lack of complete
   number indexing.
 * The indexing of single characters is somewhat easier to handle. Users can check the alphabet and note any
   letters that have particular meaning in their application, and these letters can be indexed.
 Most commereial systems take a conservative approach to these problems. For example, Chemical Abstracts
 Service, ORBIT Search Service, and Mead Data Central's LEXIS/NEXIS systems all recognize numbers and
 words containing digits as index terms, and all are case insensitive. In general they have no special provisions
 for punctuation marks, although Chemical Abstracts Service keeps hyphenated words as single tokens, and the
 other two systems break hyphenated words apart (Fox 1992).

 2~ Use of stop lists
   Additionally most automatic indexing techniques work with a stop list that prevents certain high-frequency
 or "fluff,' words from being indexed. Francis & Kucera (1982) found that the ten most frequendy used words in
 the English language typically account for twenty to thirty percent of the terms in a documenL These terms use
 large amounts of index storage and cause poor matches (although this is not usually a problem because of the
 use of multiple query terms for matching purposes).
   One commonly-used approach to building a stop list is to use one of the many lists generated in the pasL
 Francis & Kucera (1982) produced a stop list of 425 words derived from the Brown corpus, and a list of 250
 stopwords was published by van Rijsbergen (1975). These lists contain many of the words that always have a
 high frequency, such as "a", "and", "the", and "is", but also may contain "fluff,' words that may not have a high

                                                                             4

frequency for some text collections, such as "beloW", "near", "always", and "that". Note that unlike high fre-
quency words, "fluff" words do not necessarily hurt retrieval performance, and will not Seriously affect storage.
Often these words become crucial to retrieval, such as in a query "stocks with costs below X dollars", or "res-
taurants near the harbor".
  A more suitable method of constructing a stop list would be to produce a word frequency listing for the text
to be indexed, and then examine each of the high frequency words. If there is no known importance of a given
word in the application, then that word can be safely placed on a stop list. An example of this pnccedure is the
work done at the National Institute of Standards and Technology (NIST) with a 25~megabyte collection of the
Wall Street Journal. The top twenty-seven high-frequency words were examined, and four words were removed
as possibly important ("a", "at", "from" and "to"). The rernaining twenty-tlii~e words then became the stop list.
This was a reduction from a previously-used stop list from the SMART project of 418 words. The shrinkage of
the stop list caused an increase of about 25% in the index storage, but made available for searching an addi-
tional 395 words. This new stop list is shown as Table 2 as an illustration of an abbreviated stop list rather than
as a particularly recommended one.


                               TABLE 2
                          Sample Stop Words
                     an   been   in   or     which
                     and  but    is   that   will
                     are  by     it   the    with
                     as   for    of   this
                     be   have   on   was

  It should be noted that commercial systems are even more conservative in the use of stop lists. ORBIT
Search Service has only eight stop words: "and", "an", "by", "from", "of", "or", "the", and "with" ~ox 1992).
The MEDLARS system has even fewer stop words.


2A Use of suffixing or stemming
  Many information retrieval systems also use suffixing or stemming to replace all indexed words with their
root forms. Different stemming algorithms have been used, including "standard" algorithms, and algorithms built
for a specific domain such as medical English ~acak 1978). For a survey of the various algorithms see Frakes
(1992). Three standard algorithms, an "S" stemming algorithm, the Lovins (1%8) algorithm, and the Porter
(1980) algorithm, are most often used, and the following excerpts (Harman 1991) show some of their charac-
teristics.
  The "S stemming algorithm, a basic algorithm conflating singular and plural word forms, is commonly used
for minimal stemming. The rules for a version of this stemmer, shown in Table 3, are only applied to words of
sufficient length (three or more characters), and are applied in an order dependent manner (i.e., the first applica-
ble rule encountered is the only one used). Each rule has three parts: a specification of the qualifying word
ending, such as "ies"; a list of exceptions; and the necessary action.

                                                                                     5

                            TABLE 3
                          An "S" Stemmer

            IF  a word ends in "ies", but not "eies" or "ales"
              THEN "ies"--> 11y

            IF  a word ends in "es", but not "aes", "ecs", or "oes"
              THEN "es" --> "e"

            IF  a word ends in "S", but not "us" or "55"
              THEN "5"--> NULL

  The Lovins stemmer woiks similarly, but on a much larger scale. It contains a list of over 260 possible
suffixes, a large exception list, and many cleanup rules. In contrast, the Porter algorithm looks for about 60
suffixes, producing word variant conflation intermediate between a simple singular-plural technique and Lovins
algorithm. Table 4 shows an example of the differences among the three stemmers. The first column shows the
actual words (full words) from the query. The next three columns show the words that are conflated with the
original words (words that stem to the same root for that stemmer) based on three different stemmers. The
starred terms are the ones that were useful in retrieval for this particular query and are shown only to indicate
the "quasi-random" matching that occurs when matching query terms with terms in relevant documents.

                                      TABLE 4
             Stemmer Differences for query 109 of the Cranfield test. collection

           Query -- panels subjected to aerodynamic heating

           FULL WORD           S                PORTER              LOVINS
           *panels        *panel             *panel              *panel
          ______________  *panels            *panels             *panels
           subjected      subjected          subjected           subjected
                                             *subject            *subject
                                             subjective          subjective
           __________                        subjects            subjects
           *aerodyn~i~    *aerodynamic       *aerodynamic        *aerodynamjc
                          aerodynamics       aerodynamics        aerodynamics
                                             *aerodynamically    *aerodynamically
           _____________ ______________ _________________        aerodynamicist
           *heating       *heating           *heating            *heating
                                             *heated             *heated
                                                                 *heat
                                                                 heats
                                                                 heater

  Stemming or suffixing is done for two principal reasons:  the reduction in index storage required and the
increase in performance due to the use of word variants. The storage savings using stemming is data and imple-
mentation dependent. For small text collections on machines with litfie ~torage, a sizable amount of inverted file
storage can be saved using stemming. For the 1.6 megabyte manual shown in Table 1, approximately 20% of
storage was saved by using the Lovins stemmer. Lennon eL al. (1981) showed compression percentages for the
Lovins stemmer of 45.8% for the Brown Corpus.  However, for the larger text collections normally used in
online retrieval, less storage is saved. The savings was less than 14% for the text of 50 megabytes in Table 1,
probably because this text contains large amounts of numbers, misspellings, proper names, etc. (items that usu-
ally cannot be stemmed).

                                                                        6


  In terms of performance improvements, research has shown that on the average res'~ults were not improved by
using a stemmer. However, system performance must reflect a user's expections, and the use of a stemmer (par-
ticularly die S stemmer) is intuitive to many users. The OKAPI project (Walker & Jones 1987) did extensive
work on improving retrieval in online catalogs, and strongly recommended using a "weak" stemmer at all times,
as the "weak" stemmer (removal of plurals, "ed" and "ing't) seldom hurt performance, but provided significant
improvement. They found drops in precision for some queries using a "strong" stemmer (a variation of the
Porter algorithm), and therefore recommended the use of a "strong" stemmer only when no matches were found.
Gne method of selective stemming is the avallability of truncation in many online commercial retrieval systems.
However1 Frakes (1984) found that automatic stemming performed as well as truncation by an experienced user,
and most user studies show liule actual use of truncation. Given today's retrieval speed and the ability for user
interaction, a realistic approach for online retrieval would be the automatic use of a stemmer, using an algorithm
like Porter or Lovins, but providing the ability to keep a term from being stemmed (the inverse of truncation).
If a user found that a term in the stemmed query produced too many nonrelevant documents, the query could be
resubmitted with that term marked for no stemming. In this manner, users would have full advantage of stem-
ming, but would be able to improve the results of those queries hurt by stemmin~.

3. Advanced automatic indexing techniques
  The basic index terms produced by the methods discussed in section 2 can be used "as is", with IOoolean
connectors to combine terms, or a single term may be used for simple searches. However, researchers in infor-
mation retrieval have been developing more complex automatic indexing techniques for over thirty years, and
having varying degrees of success with these new techniques doing experiments with small test collections.
Some of these techniques (such as the term weighting discussed in section 3.1) are clearly successful and are
likely to scale easily into large full-text documents. Other techniques, such as the query expansion techniques
described in section 3.2, do well 9n small test collections, but may lied additional experimexitation when used
in large full-text collections. The added discrimination provided by using phrases as indexing terms rather than
only single terms is discussed in section 3.3. In general the use of phrases has not been successful in small test
collections, but is likely to become more useful, or even critical, in large full-text documents. Large full-text
collections may need better term discrimination measures, and some recent experiments in selecting better index-
ing features or in providing more advanced term weighting are described in sections 3A and 3.5. Finally, the
notion of combining evidence from multiple types of document indexing is presented in section 3.6.

3.1 Term weighting
  Whereas terms coming from automatic indexing can be used without weights, they offer the opportunity to
do automatic term weighting. This weighting is essential to all systems doing statistical or probabilistic ranking.
Many of the commercial systems provide an ability to rank documents based on the number of terms matching
between the query and the document, but find that users do not select this option often because of poor perfor-
mance. There are several reasons for this poor performance:

 1. There is no technique for resolving ties. If there are three words in a query, it may be that only a few
   documents match all three words, but many will match two terms, arid these documents are essentially
   unranked with respect to each other.

 2. There is no allowance for word importance within a text collection. A query such as "term weighting in
   information retrieval" could return a single document containing all four non-common words, and then an
   unranked list of documents containing the two words "term" and "weighting" or "information" and
   "retrieval", all in random order. This could mean that the possibly 10 documents containing "term" and
   "weighting" are buried in 500 documents containing "information" and "retrieval".

 3. There is no allowance for word importance within a documenL Looking again at the query "term weight-
   ing in information retrieval", the cor~t order of the documents containing "term" and "weighting" would
   be by frequency of "weighting" within a document, so that the highest ranked document contains multiple
   instances of "weighting", not just a single instance.

                                                                             7


 4. There is no allowance for document length. Whereas this factor is not as important as the first three fac-
   tors, it can be important to normalize ranking for length because otherwise long documents often rank
   higher than short documents, even though the query terms may be more concentrated in the short docu-
   ments.

These problems can be largely avoided by using more complex statistical rar~g routines involving proper term
weighting or accurate similarity measures.
  Various experiments in laboratories have been concerned with developing optimal methods of weighting the
terms and optimal methods of measuring the similiarity of a document and the query. One of the term weight-
ing measures that has proven very successful is the inverted document frequency weight or IDF (Sparck Jones
1972), which is basically a measure of the scarcity of a term in the text collection. A second measure used is
some flinction of a term's frequency within a record. These measures are often combined, with appropriate nor-
malization factors for length, to form a single term weight. Statistically-ranked retrieval using this type of term
weighting has a retrieval performance that is significantly better in thelaboratory than using no term weighting
(Salton & McGill 1983, Croft 1983, Harman 1986).
  The following recommendations can be made based on this research.

 1. The use of term weighting based on the distribution of a term within a collection usually improves perfor-
   mance, and never hurts performance. The IDF measure has been commonly used for this weighting.

                  N
      fDF~ = log2 - + 1   (Sparck Jones 1972)
                  fl~

          where N = the number 6f documenis in the collection
               n. = the total frequency of term i in the collection

 2. The combination of the within-document frequency with the DF weight often provides even more improve-
   ment. It is important to normalize the within-document frequency in some manner, both to moderate the
   effect of high frequency terms in a document (i.e. a term appeang 20 times is not 20 times as important as
   one appearing only once) and to compensate for document length. Data containing very short documents
   (such as titles only) should not use weighting for within-document frequency. The following within-
   document frequency measures illustrate correct normalization procedures.

      cfreq~~ = K + (1-K)     freqq    (Croft 1983)
                          maxfreq,.

               _ log2 ~req~~+1)  (Harman 1986)
      nfreqj~ -   log2 length~


          where fre~. = the frequency of term i in document j
               maxfr~. = the maximum frequency of any term in document j
               K = the ~constant used to adjust for relative importance of within-document frequency
               length. = the number of unique terms in document j

 3. Assuming within-document term frequencies are to be used, several methods can be used for combining
   these with the IDF measure. Both the combining of term weighting and the use of this weighting in simi-
   larity measures between queries and documents are shown.

                                                                            8

                                              (Salton & Buckley 1988)

                              x w~~)
                       i=1
      similarity (Q'D) =

                      ~£x31 (w~)2 x (wij)2


         where  w   = `OS   0.5 freq~
                                   ) x IDF~
                 Lq   ~  +  rnaxfreqq

         and

                        freq~~ x IDF~

                         ~~~re~iJ x IDF~ )2


           where freci~q = the frequency of term j in query q
              maxfreq = the maximum frequency of any term in query q
              IDF. = ~e IDF of term i in the entire collection
              freq,.~ = the frequency of term i in document j

     Salton & Buckley suggest reducing the query weighting w. to only the within~ocument frequency
     (freqjq) for long queries containing multiple occurrences of te~rms, and to use only binary weighting of
     documents (w.. = 1 or 0) for collections with short documents or collections using controlled vocabulary.
               13


            Q
 similarisy,. = ~ (C + IDF~~ x cfreq~~)


     where  cfreq~~ = K + (1-K)  freq~~
                                maxfreq~

                                    (Croft 1983)

     where fre~~ = the frequency of term i in document j
        C = the constant used to adjust for relative importance of all term weighting
        maxfreq. = the maximum frequency of any term in document j
        K = the constant used to adjust for relative importance of within-document frequency


 C should be set to low values (near 0) for automatically indexed collections, and to higher values such as 1
 for manually-indexed collections. K should be set to low values (0.3 was used by Croft) for coflections with
 long (35 or more terms) documents, and to higher values (0.5 or higher) for collections with short documents,
 reducing the role of within-document frequency.


            Q log2 ~req~~+1) x IDF~
 similarity~ = x                        (ilarman 1986)
                      log2 length~

 where fre~~ = the frequency of term i in document j
     length. = the number of unique terms in document j


4. It can be very useful to add additional weight for document structure, such as higher weightings for terms
   appearing in the title or abstract versus those appearing only in the texL This additional weighting needs to
   be considered with respect to the particular text collection being used for searching.
This section on term weighting presents only a few of the experimental techniques that have been tried. For a
more thorough survey, see Harman (199~).

                                                                                  9


32 Query expansion
   One of the problems found in all information retrieval Systems is that relevant documents are missed because
they contain no terms from the query. Whereas often users do not want to find most of the relevant documents,
sometimes they want to find many more relevant documents and are willing to examine more documents in
hopes of finding more relevant ones.  However the automatic indexing systems generally do not offer the
"higher-level" terms describing a document that could have been manually assigned, and it is difficult to gen-
crate a more exhaustive search. One way around this difficulty is to provide tools for query expansion. A sim-
pie example of such a tool would be the ability to browse the dictionary or word list for the text~collection.
Two more sophisticated techniques would be the use of relevance feedback or the use of an automatically-
constructed thesaurus.
   Relevance feedback is a technique that allows users to select a few relevant documents and then ask the sys-
tem to use these documents to improve performance, i.e. retrieve more relevant documents. There has been a
significant amount of research into using this method, although there are few user experiments on large test col-
lections. Salton & Buckley (199O~) showed that adding relevance feedback to their similarity measure results in
up to 100% improvement for small test collections. Croft (1983) used the relevant and nonrelevant documents
to probabilistically change the term weighting and in 1990 he extended this work by also expanding queries
using terms in the relevant documents. A similar approach was taken by Harman (199~) and these results
(again for a small test collection) showed improvements of around 100% in performance. Clearly the use of
relevance judgments to improve performance is important in full-text searching and can supplement the use of
the basic automatically-indexed terms, but the exact methods of using these relevance judgments is still to be
determined for large full-text documents. Possibly their best use is in providing an interactive tool for modify-
mg the query by suggesting new terms. For a survey of the use of relevance feedback in experimental retrieval
systems, including Boolean systems, see Harman (1992b).
   A different method of query expansion could be the use of a thesaurus. This thesaurus could be used as a
browsing tool, or could be incorporated automatically in some manner. The bullding of such a thesaurus, how-
ever, is a massive, often domain-dependent task~ Some research has been done into automatically building a
thesaurus. Sparck Jones & Jackson (1970) experimented with clustering terms based on co-occurrance of these
terms in documents. They tried several different clustering techniques and several different methods of using
these clusters on the manually-indexed Cranfield collection. The major results on this small test collection
showed that 1) it is important NOT to cluster high frequency terms (they became unit clusters), 2) it is important
to create small clusters, and 3) it is better to search using the clusters alone rather than a "mixed-mode" of clus-
ters and single terms. Crouch (1988) also generated small clusters of low freqency terms, but had good results
searching using query terms augmented by thesaurus classes. Careful attention was paid to properly weighting
these additional "terms". It is of course unknown how these results scale up to large full-text collections, but the
concept seems promising enough to encourage further experimentation.

3~ The use of multiple-word phrases for indexing
   L~ge full-text collections not only need special query expansion devices to improve recall (the percentage of
total relevant documents retrieved), but also need precision devices to improve their accuracy. One important
precision device is the term weighting discussed in section 3.1. The ability to provide ranked output improves
precision because users are no longer looking at a random ordering of selected documents. However further
improvement in precision may be necessary for searching in large full-text collections, and oue way to get addi-
tional accuracy is to require more stringent matching, such as phrase matching.
   Phrase matching has been used in experiments in information retrieval for many years, but has currently got-
ten more attention because of improvements in natural language technology. The initial phrase matching used
templates (Weiss 1970) rather than deep natural language parsing algorithms. The FASIT system ([)illon &
Gray 1983; Burgin & Dillon 1992) used template matching by creating a dictionary of syntactic category pat-
terns and using this dictionary to locate phrases. They assigned syntactic categories by using a suffix dictionary
and exception list. The phrases detected by this system were normalized and then merged into concept groups
for the final matching with queries.
   A second type of phrase detection method that is based purely on statistics was investigated by Fagan (1987,
1989). This type of system relies on statistical co-occurrances of terms, as did the automatic thesaurus building
described in section 3.2, but requires that these terms co-occur in more limited domains (such as within

                                                                                                     10

paragraphs or within sentences), and within a set proximity of each other. Fagan investigated the use of many
different parameters for selecting these phrases, and then added the phrases as supplemental index terms, i.e. all
single terms were first indexed and then some additional phrases were produced.
   Fagan (1987) also examined the use of complete syntactic parsing to generate phrases. The parser generated
syntactic parse trees for each sentence and the phrases were then defined as subtrees of those parse trees that
met certain structural criteria.  Salton eL al. (1989, 1990b) compared the phrases generated for two book
chapters both by the statistical methods and the syntactic methods and found that both methods generated many
correct phrases, but that the overlap of those phrases was small. Salton eL al. (1990c) also tried a syntactic
tagger and bracketer (Church 1988) to identify phrases. The tagger uses statistical methods to produce syntactic
part~of-speech tags, and the bracketer identifies phrases consisting of noun and adjective sequences.  This
simpler approach does not require the completion of entire parse trees and seemed to produce as many good
phrases.
   In general, retrieval experiments that add phrases to single term indexing have not been successful with
small test collections. One reason has been the scarcity of phrases~in the text that match phrases in the query.
Lewis & Croft (1990) tried first locating phrases using a chart parser, and then clustering these phrses. The
retrieval used single terms, phrases, and clustered phrases in different combinations. The best performance used
terms, phrases, and clustered phrases as features for retrieval.  However even this performance was not
significantly better than performance using only single terms for the small test collection used.
   The current feeling among researchers is that the use of multiple-word phrases will be successful only for
large collections of texL This is parially because of the need for enough text to locate phrases that will be
good features for retrieval. Equally important, the higher precision retrieval offered by phrases may only be
important in the larger full-text retrieval environment. Croft el al. (1991) investigated various ways of both gen-
erating and us~ing Phrses in retrieval, and although their results. on the small CA: CM test collection weTe not
significant, the work they are doing on a larger test collection shows impressive results using phrases. It is
likely that the use of phrases for retrieval in large full-text retrieval environments will show significant, and pos-
sibly critical, improvements over single term indexing.

3A Feature selection
   Another method of improving precision in retrieval from large full-text data is to select indexing features
more carefully. The current approach to automatic indexing generally indexes all stems in a document, elim-
inating only stopwords and possibly numbers. This exhaustive coverage may be important for small documents
such as abstracts or bibliographic records, but using all terms in very large records may weaken the matching
criteri~  Ideally one would like to be able to automatically select the single terms or phrases which best
represent a document. Unfortunately this area has attracted little research because of the absence of large full-
text test collections.
   Two recent papers address this issue. The first paper (Strzalkowski 1992) described some research using a
statistical retrieval system with some improvements based on natural language techniques. Strzalkowski used a
very fast syntactic parser to parse the text. The phrases found using this parser were then statistically analyzed
and filtered to produce automatically a set of semantic relationships between the words and subphrases. This
highly selective set of phrases was then used to both expand and filter the query. The results on the small
CACM collection showed a significant improvement in performance over the straight statistical methods, and
these techniques clearly will scale up to larger full-text documents.
   The second paper (1£wis 1992) was an investigation into feature selection using a classification test collec-
tion. This test collection contains 21,450 Reuters newswires that have been manually classified into 135 topic
descriptions. The goal of this research was to identify what text features (terms, phrases, or phrase clusters)
were important in generating these categories. Best results were obtained for a small number of features (10 -
15), and some discussion is made of the best ways to select these features. This type of approach to feature
 pruning also needs to be further explored for large full-text collections.

3.5 More advanced term weighting techniques
   A third approach to increased precision for the larger documents is to use all terms for indexing, but to pro-
vide more sophisticated term weighting methods than those discussed in section 3.1. Salton & Buckley (1991)

                                                                                                 11


presented results from work using an online encyclopedia in which they weighted terms both globally for an
entire document (as in section 3.1), but also locally for a given sentence. In this particular experiment they per-
formed multiple-stage searching in which a short initial query was used to find one or more relevant sections or
paragraphs, and then these sections were used to find similar sections using both global and local weighting
schemes. Whereas the global weights help increase the recall by returning many similar items, the local weights
can be used as a filtering operation to improve the precision of the returned seL Further details can be found in
a technical report (Salton & Buckley 1990d). This type of approach to searching and term weighting may be
particularly suitable for large lull-text data collections.

3.6 Using combinations of indexing techniques
  All the preceding research efforts had as a basis the combination of various information from the text to
improve indexing and searching. The best term weighting schemes discussed in section 3.1 combined different
statistical measures of term importance. The section on query expansion dealt with combining information
about term co-ccurrence to automatically identify better query terms and term weights. The work on multiple
word phrases investigated how to locate phrases, but also how to correctly combine these phrases with single
terms. Feature selection involves combining information from the text to help better select which features to
index, and the advanced term weighting techniques combine term weights at two granularity levels to improve
precision.
  Other more explicit combination techniques have been tried, from simple user weighting of terms (to be
combined with the statistical term weighting), to combining of database attributes with free text (Deogun &
Raghavan 1988), to more elaborate combining of concepts such as citations, attributes, and data into the vector
space model ~ox eL al. 1988). Results have generally shown improvements in performance, even for small test
collections. This combination of various sources of information can be extended to combining various types of
indexing (such as manual or automatic),. Various types of queries (such as using or not using Boolean connec-
tors), or various types of searching (such as cluster searching vs document searching).  It has been shown
(Katzer CL al. 1982) that different indexing or searching methods can produce comparable results, but with liale
overlap between the sets of relevant documents. Clearly it would be ideal to combine these methods, but the
method for combining the completely different approaches to indexing and searching is not easily apparenL
  A new model, the inference network CFurtle and Croft 1991) is designed specifically for this task of combin-
ing evidence or probabilities from all these different methods. This network consists of term nodes, document
nodes, and query nodes, connected by finks with probabilitistic weighting factors, and can be used to try multi-
ple ways of combining information from these nodes to form a list of documents ranked in order of likely
relevance to a user's need. Turtle & Croft show how this model can be used to represent most of the basic
indexing and searching techniques, and discuss how the generation of this model provides the scope for a
thorough investions of how to perform complex combinations of techniques. This type of representation can be
viewed as a very advanced indexing method, and may prove important in handling large full-text data~

3.7 Summary
  Whereas the traditional automatic single term indexing described in section 2 enables reasonable searching of
large full-text documents, the more advanced techniques discussed in this section may all prove important in
raising the retrieval performance beyond a mediocre level. It is critical that research continue into these
advanced techniques, and others like them, and that as they become proven methodologies, they be accepted as
standard automatic indexing techniques by the information retrieval community as a whole.


REFERENCES

Burgin R. and Dillon M. (1992). Improving Disambiguation in FASIT. Journal of the American Society for
Jnformation Science, 43(2), 101-114.

Church K. (1988). A Stochastic Part Program and Noun Phrase Parser for Unrestricted Text. In: Proceedings of
the Second Conference on Applied Natural ~nguage Processing; 1988, 13&143; Austin, Texas.

Cleverdon C.W. and Keen E.M. (1966). Factors Determining the Performance of Indexing Systems, Vol.1:

                                                                                   12


Design, Vol.2: Test Results. Aslib Cranfield Research Project, Cranfield, England, 1966.

Croft W.B. (1983). Experiments with Representation in a Document Retrieval System. Information Technology:
Research and Development, 2(1), 1-21.

Croft W.B. and Das R.(1990). Experiments with Query Acquistion and Use in Document Retrieval Systems. In:
Proceedings of the 13th International Conference on Research and Development in Information Retrieval; Sep-
tember 1990,349-368; Brussels, Belgium.

Croft W.B., Turtle H., and Lewis D.(1991).  The Use of ~rases and Structured Queries in Information
Retrieval. In: Proceedings of the 14th International Conference on Research and Development in Information
Retrieval; October 1991, 3245; Chicago, Illinois.

Crouch CJ. (1988). A Cluster-Based Approach to Thesaurus Construction. In: Proceedings of the ACM Confer-
ence on Research and Developmentlti Information Retrieval; June 1988, 309-320; Grenoble, France.

Deogun J.S. and Raghavan V.V. (1988). Integration of Information Retrieval and Database Management Sys-
tems. Information Processing and Management, 24(3), 303-313.

Dillon M. and Gray A.S. (1983). FASIT: A fully automatic syntactically based indexing system. Journal of the
American Society for Information Science, 34(2), 99-108.

Fagan J. (1987). Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison on Syntac-
tic and Nonsyntactic Methods. Doctoral dissertation, Cornell University, Ithaca, N.Y.

Fagan J. (1989). The Effectiveness of a Nonsyntactic Approach to Automatic Phrase Indexing for Document
Retrieval, Journal of the American Society for Information Science, 40(2), 115-132.

Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., ~d.), Information Retrieval:
Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall.

Fox E.A., Nunn G.L., and Lee W.C. (1988). Coefficients of Combining Concept Classes in a Collection. In:
Proceedings of the ACM Conference on Research and Development in Information Retrieval; June 1988, 291-
308; Grenoble, France.

Frakes W.B. (1984). Term Conflation for Information Retrieval. In: Proceedings of the Third Joint BCS and
ACM symposium on Research and Development in Information Retrieval; July 1984, 383-390; Cambridge, Eng-
land.

Frakes W.B. (1992). Stemming Algorithms. In Frakes W.B. and Baeza-Yates R., ~d.), Information Retrieval:
Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall.

Francis W. and Kucera H. (1982). Frequency Analysis of English Usage, New York, NY.: Houghton Miffin.

Ilarman D. (1986). An Experimental Study of Factors Important in Docurnent Ranking. In: Proceedings of the
ACM Conference on Research and Development in Information Retrieval; September 1986, 186-193; Pisa, Italy.

Harman D. (1991). How Effective is Suffixing? Journal of the American Society for Information Science, 42(1),
7-15.

Ilarman D. (199~). Ranking Algorithms. In Frakes W.B. and Baeza-Yates R., (Ed.), Information Retrieval:
Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall.

Harman D. (1992b). Relevance Feedback and Other Query Modification Techniques. In Frakes W.B. and
Baeza-Yates R., ~d.), Information Retrieval: Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-
Hall.

                                                                              13

Harman D.(199~).   Relevance Feedback Revisited. In: Proceedings of the 15th International Conference on
Research and Development in Information Retrieval; June 1991, 1-10; Copenhagen, Denmark.

Harman D. and Candela G. (1990). Retrieving Records from a Gigabyte of Text on a Minicomputer using Sta-
tistical Ranking, Journal of the American Society for I~ormadon Science, 41(8), 581-589.

Katzer J., M~ill MI., Tessier J.A., Frakes W., and DasGupta P. (1982). A Study of the Overlap among Ioocu-
ment Representations. J~ormation Technology: Research and Development, 1(2), 261-274.

Lewis D.D. (1992). Feature Selection and Feature Extraction for Text Categorization. Paper to appear in the
Proceedings of the 5th DARPA Workshop on Speech and Natural Language, Harriman, N.Y.

Lewis D.D. and Croft W.B. (1990). Term Clustering of Syntactic Phrases, In: Procelings of the 13th Interna-
tional Conference on Research and Development in Information Retrieval; September 1990; 385-504; Brussels,
Belgium.

Lennon M., Peirce D., Tarry B., Willett P. (1981). An Evaluation of Some Conflation Algorithms for Informa-
tion Retrieval, Journal of Information Science, 3, 177-188.

Lovins J.B. (1%8).  Development of a Stemming Algorithm, Mechanical Translation and Computational
Linguistics, 11, 22-31.

Luhn HY. (1957). A Statistical Approach to Mechanized Encoding and Searching of Literary Information, IBM
Journal of Research and Development, 1(4), 309-317.

Pacak M.G. and Pratt A.W. (1978). Identification and Transformation of Terminal Morphemes in Medical
English, Part II, Methods of I~ormation in Medicine, 17, 95-100.

Porter M.F. (1980). An Algorithm for Suffix Stripping, Program, 14(3),130-137.

Salton G. and Buckley C. (1988). Term-Weighting Approaches in Automatic Text Retrieval. Information Pro-
cessing and Management, 24(5), 513-523.

Salton G., and Buckley C. (1989). A Comparison between Statistically and Syntactically Generated Term
Phrases. Technical Report TR 89-1027, Cornell University: Computing Science DepartmenL

Salton G. and Buckley C. (1990a). Improving Retrieval Performance by Relevance Feedback. Journal of the
American Society for Information Science, 41(4), 288-297.

Salton G., and Buckley C. (199Od). An Evaluation of Text Matching Systems for Text Excerpts of Varying
Scope. Technical Report TR 89-1027, Cornell University: Computing Science Department.

Salton G. and Buckley C. (1991). Automatic Text Structuring and Retrieval: Experiments in Automatic Encyclo-
pedia Searching. In: Proceedings of the 14th International Conference on Research and Development in Infor-
mation Retrieval; October 1991, 21-31; Chicago, illinois.

Salton G., Buckley C., and Smith M. (1990b) On the Application of Syntactic Methodologies in Automatic Text
Analysis, Information Processing and Management, 26(1), 73-92.

Salton G., Zhao Z. and Buckley C. (199Oc). A Simple Syntactic Approach for the Generation of Indexing
Phrases Technical Report TR 90-1137, Cornell University: Computing Science Department.

Salton 0. and McGill M. (1983). Introduction to Modern Information Retrieval. New York, NY.: McGraw-Hill.

Sparck Jones K. (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal
of Documentation, 28(1), 11-20.

                                                                                                 14


Sparck Jones K. and Jackson D.M. (1970). The Use of Automatically-Obtained Keyword Classifications for
Information Retrieval. Information Storage and Retrieval, 5, 175-201.

Strzalkowski T. (1992). Information Retrieval using Robust Natuual Language. Paper loappear in the Proceed-
ings of the 5th DARPA Workshop on Speech and Natural Language, Harriman, N.Y.

Turtle H. and Croft W.B. (1991). Evaluation of an Inference Network-Based Retrieval Model. ACM Transac-
lions on Information Systems, 9(3), 187-222.

van Rijsbergen CJ. (1975). Information Retrieval. London: Butterworths.

Walker, S. and  Jones, R. M. (1987). Improving Subject Retrieval in Online Catalogues, British Library
Research Paper 24.

Weiss S.F. (1970). A Template Approach to Natural Language Analysis for Thformation Retrieval.  Doctoral
dissertation, Cornell University, Ithaca, N.Y.
Last updated: [an error occurred while processing this directive]
Date created:
Monday, 31-Jul-00