SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
Automatic Indexing
In the Sinart context, the vector-I)I'occssiIIk-) ino4el of ret i'i[OCRerr]vai is [OCRerr]sed to transfoi'in 1)0th tile available
information re(llIests as well as t[OCRerr]le stored doclinlents il)to vector form of' ti[OCRerr]c type
where D[OCRerr] represents a. doci[OCRerr]ment (or (IlI([OCRerr]Iv ) text and [OCRerr]1"ik IS a tei'ni weight of tei'm 7[OCRerr] (`Ittached to
documeut I)[OCRerr]. A weight of zero is used for terius that a[OCRerr]e a.l)5C11t from a. 1)articnI([OCRerr]r document, and
positive weights characterize terms actually assigned. [OCRerr][OCRerr]lie assuiliption is that I. terlils in Mi a.re
available for the repiesentation of the lufoimation.
In choosing a. term weightiug sysl.eIil low weights ShOUl(l l)e assigned to lligl1[OCRerr]fre(IIIency terms
that occur in many documents of a collectioii, and high weights to ter[OCRerr]s that are important in
particular documents but uniniportant iii the renia.indei' of the collection. [OCRerr]:lie weight of terms that
occur rarely in a. colk'ction is ii iiimporta[OCRerr]t, l)ecailse such terilis colltI'il)ute little to the
rela.tivelv
needed similarity computation l)etween diffeicilt texts.
A well-known term weighting system following that l)rescril)tion assigns weights U'ik to term
Tk in doc[OCRerr]ent D[OCRerr] in proportion to the fre(1uency of occu ricuce of a. terni iii D[OCRerr], and in inverse
proportion to the iium1)er of documents to which the terili is assigned .[OCRerr]6[OCRerr]9] Such a. weightiug system
is known a.s a. t.f * i([OCRerr]f (term fre([OCRerr]uency times iiiverse dociuneut frequency) weighting system. In
practice the document lengt[OCRerr]i, an(l heuce the liii mher of noli-zero term weights assigned to as
document, varies wi(lelv. To give each text item an equal cltauce of being retrieved, it. is couveujent
to use a length normalization factoi' a.s 1)ai't of the terni weighting formula. A liigh-([OCRerr]tiality term
weighting fo[OCRerr]ula for U)[OCRerr]k, the weight of term Tk in document D[OCRerr] is
[OCRerr][OCRerr]ik fik * log([OCRerr]'/il.k) (1)
;Zk[OCRerr]=[OCRerr](Lk * log(A"/ilk))2
where fik is the occurrence frequency of `I[OCRerr] in D[OCRerr], AT is the collection size atid 1[OCRerr]k the number of
(1ocuiiients with term 1[OCRerr] assigned. The factor log( [OCRerr]V/Il.k ) is a.n inverse collectioii frequency factor
which decreases as ternis are used widely iii a collection, and the (lenominator iii expression (1) is
used for weight nor niMi zat ion.
The terms Tk included in a. given vector cali in principle represent any entities assigned to
a. document for content identihcation. In the Smart cotitexi., such terms are derived by a. text
transformation of the following kind : [2]
1. recognize individual text words
2. use stop list to eliminate unwanted fuuction words
3. perform suffix removal to genera.te word stems
4. optionMly use term grouping methods based on a. statistical word co-occurreilce, or word
adjacency, computation to for in terni phrases (alternatively syntactic analysis computations
can also be used)
5. assign term weights to all remaining word stems and/or phiase stems to form the term vector
for all information items.
60