NIST Interagency Report 4873: Automatic Indexing

IR4873 NIST Interagency Report 4873: Automatic Indexing Automatic Indexing chapter Donna Harman National Institute of Standards and Technology 7 4. There is no allowance for document length. Whereas this factor is not as important as the first three fac- tors, it can be important to normalize ranking for length because otherwise long documents often rank higher than short documents, even though the query terms may be more concentrated in the short docu- ments. These problems can be largely avoided by using more complex statistical rar[OCRerr]g routines involving proper term weighting or accurate similarity measures. Various experiments in laboratories have been concerned with developing optimal methods of weighting the terms and optimal methods of measuring the similiarity of a document and the query. One of the term weight- ing measures that has proven very successful is the inverted document frequency weight or IDF (Sparck Jones 1972), which is basically a measure of the scarcity of a term in the text collection. A second measure used is some flinction of a term's frequency within a record. These measures are often combined, with appropriate nor- malization factors for length, to form a single term weight. Statistically-ranked retrieval using this type of term weighting has a retrieval performance that is significantly better in thelaboratory than using no term weighting (Salton & McGill 1983, Croft 1983, Harman 1986). The following recommendations can be made based on this research. 1. The use of term weighting based on the distribution of a term within a collection usually improves perfor- mance, and never hurts performance. The IDF measure has been commonly used for this weighting. N fDF[OCRerr] = log2 - + 1 (Sparck Jones 1972) fl[OCRerr] where N = the number 6f documenis in the collection n. = the total frequency of term i in the collection 2. The combination of the within-document frequency with the DF weight often provides even more improve- ment. It is important to normalize the within-document frequency in some manner, both to moderate the effect of high frequency terms in a document (i.e. a term appeang 20 times is not 20 times as important as one appearing only once) and to compensate for document length. Data containing very short documents (such as titles only) should not use weighting for within-document frequency. The following within- document frequency measures illustrate correct normalization procedures. cfreq[OCRerr][OCRerr] = K + (1-K) freqq (Croft 1983) maxfreq,. _ log2 [OCRerr]req[OCRerr][OCRerr]+1) (Harman 1986) nfreqj[OCRerr] - log2 length[OCRerr] where fre[OCRerr]. = the frequency of term i in document j maxfr[OCRerr]. = the maximum frequency of any term in document j K = the [OCRerr]constant used to adjust for relative importance of within-document frequency length. = the number of unique terms in document j 3. Assuming within-document term frequencies are to be used, several methods can be used for combining these with the IDF measure. Both the combining of term weighting and the use of this weighting in simi- larity measures between queries and documents are shown.