IR4873
NIST Interagency Report 4873: Automatic Indexing
Automatic Indexing
chapter
Donna Harman
National Institute of Standards and Technology
8
(Salton & Buckley 1988)
x w[OCRerr][OCRerr])
i=1
similarity (Q'D) =
[OCRerr]£x31 (w[OCRerr])2 x (wij)2
where w = `OS 0.5 freq[OCRerr]
) x IDF[OCRerr]
Lq [OCRerr] + rnaxfreqq
and
freq[OCRerr][OCRerr] x IDF[OCRerr]
[OCRerr][OCRerr][OCRerr]re[OCRerr]iJ x IDF[OCRerr] )2
where freci[OCRerr]q = the frequency of term j in query q
maxfreq = the maximum frequency of any term in query q
IDF. = [OCRerr]e IDF of term i in the entire collection
freq,.[OCRerr] = the frequency of term i in document j
Salton & Buckley suggest reducing the query weighting w. to only the within[OCRerr]ocument frequency
(freqjq) for long queries containing multiple occurrences of te[OCRerr]rms, and to use only binary weighting of
documents (w.. = 1 or 0) for collections with short documents or collections using controlled vocabulary.
13
Q
similarisy,. = [OCRerr] (C + IDF[OCRerr][OCRerr] x cfreq[OCRerr][OCRerr])
where cfreq[OCRerr][OCRerr] = K + (1-K) freq[OCRerr][OCRerr]
maxfreq[OCRerr]
(Croft 1983)
where fre[OCRerr][OCRerr] = the frequency of term i in document j
C = the constant used to adjust for relative importance of all term weighting
maxfreq. = the maximum frequency of any term in document j
K = the constant used to adjust for relative importance of within-document frequency
C should be set to low values (near 0) for automatically indexed collections, and to higher values such as 1
for manually-indexed collections. K should be set to low values (0.3 was used by Croft) for coflections with
long (35 or more terms) documents, and to higher values (0.5 or higher) for collections with short documents,
reducing the role of within-document frequency.
Q log2 [OCRerr]req[OCRerr][OCRerr]+1) x IDF[OCRerr]
similarity[OCRerr] = x (ilarman 1986)
log2 length[OCRerr]
where fre[OCRerr][OCRerr] = the frequency of term i in document j
length. = the number of unique terms in document j
4. It can be very useful to add additional weight for document structure, such as higher weightings for terms
appearing in the title or abstract versus those appearing only in the texL This additional weighting needs to
be considered with respect to the particular text collection being used for searching.
This section on term weighting presents only a few of the experimental techniques that have been tried. For a
more thorough survey, see Harman (199[OCRerr]).