IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 2 example the application of searching an online manual might have a record defined as the lowest subsection, so that users find and display very exact subsections of material. If the application is to provide pointers into paper copies of long articles (such as 1Oo-page+ court cases), it might be reasonable to make each page or small sec- tion a record so that the display could show a one-line sentence with the hits, and give the page number. The choice of record size is not only important for display, but also is critical for effective searching. A record which is too short provides little text for the searching algorithms to use and will cause poor results. Too large a record, however, may dilute the importance of word matches, and cause many fitlse matches. For these reasons it would not be reasonable to make a sentence a record, but paragraphs might be fine as records. Alter- natively it would not be effective to make a very long section a record; it would be better to break it into smaller subsections. Further, the choice of record size may also affect the choice of term weighting and retrieval algorithms (see section 3.1 on term weighting). A recent paper (Harman & Candela 1990) shows some possible record size decisions and their consequences. Three different text collections were involved in user testing of a retrieval system using automatic indexing and statistical ralaing. The first text collection was small (1.6 megabytes) and consisted of a[OCRerr]manual organized into sections and chapters. A record was determined to be equivalent to a paragraph in this manual, because this appeared to be the most useful record size for the end users. This decision caused many short records (see Table 1). The second text collection was a legal code book, with sections and subsections. Here the records were set to be each subsection, again based on user preference. The records were therefore much larger, with many words occurring multiple times within each record. The third text collection consisted of about 40,000 court cases. A record here was set to be a court case. Table 1 shows some basic statistics on these text collec- tions. The average number of terms per record includes duplicate terms and is a measure of the record length rather than the number of unique term occurrences. The average number of postings per term is the average number of documents containing that term. TABLE 1 Collection Statistics Size of collection 1.6 MB 50 MB 806 MB Number of records 2653 6652 38304 Average number of terms per record 96 1124 3264 Number of unique 5123 25129 243470 terms Average postings 14 40 88 per_term _________ ________ _________ 2.2 What constitutes a word and what "wordstl to index The second key decision for any indexing is the choice of what constitutes a words to index. In manual indexing systems this choice is easily made by the automatic indexing it is necessary to define what punctuation should be used as what "words" to index. word and then which of these human indexer. However for word separators and to define Normally word separators include all white spaces and all punctuation. However there &e many exceptions to this rule, and, depending on the application and the searching software, the methods of handling these excep- tions can be crucial to successful retrieval. The following examples illustrate some of the problems encountered in typical applications. Hyphens -- some words can appear in both hyphenated and unhyphenated versions. Sometimes the treatment of hyphens is critical to retrieval, such as in chemical names and other normally hyphenated elements [OCRerr]lycol-sebacic, F-iS, MS-DOS, etc.). Periods -- periods can appear as a part of a word, such as computer file names (paper.versionl), subsection titles (1.367A), and in company names. * Slashes, parentheses, underscores -- these can appear as parts of words (OS/2), as parts of section titles (367(A)), and as parts of terms in programming languages (doc_no).