SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman Automatic Indexing In the Sinart context, the vector-I)I'occssiIIk-) ino4el of ret i'i[OCRerr]vai is [OCRerr]sed to transfoi'in 1)0th tile available information re(llIests as well as t[OCRerr]le stored doclinlents il)to vector form of' ti[OCRerr]c type where D[OCRerr] represents a. doci[OCRerr]ment (or (IlI([OCRerr]Iv ) text and [OCRerr]1"ik IS a tei'ni weight of tei'm 7[OCRerr] (`Ittached to documeut I)[OCRerr]. A weight of zero is used for terius that a[OCRerr]e a.l)5C11t from a. 1)articnI([OCRerr]r document, and positive weights characterize terms actually assigned. [OCRerr][OCRerr]lie assuiliption is that I. terlils in Mi a.re available for the repiesentation of the lufoimation. In choosing a. term weightiug sysl.eIil low weights ShOUl(l l)e assigned to lligl1[OCRerr]fre(IIIency terms that occur in many documents of a collectioii, and high weights to ter[OCRerr]s that are important in particular documents but uniniportant iii the renia.indei' of the collection. [OCRerr]:lie weight of terms that occur rarely in a. colk'ction is ii iiimporta[OCRerr]t, l)ecailse such terilis colltI'il)ute little to the rela.tivelv needed similarity computation l)etween diffeicilt texts. A well-known term weighting system following that l)rescril)tion assigns weights U'ik to term Tk in doc[OCRerr]ent D[OCRerr] in proportion to the fre(1uency of occu ricuce of a. terni iii D[OCRerr], and in inverse proportion to the iium1)er of documents to which the terili is assigned .[OCRerr]6[OCRerr]9] Such a. weightiug system is known a.s a. t.f * i([OCRerr]f (term fre([OCRerr]uency times iiiverse dociuneut frequency) weighting system. In practice the document lengt[OCRerr]i, an(l heuce the liii mher of noli-zero term weights assigned to as document, varies wi(lelv. To give each text item an equal cltauce of being retrieved, it. is couveujent to use a length normalization factoi' a.s 1)ai't of the terni weighting formula. A liigh-([OCRerr]tiality term weighting fo[OCRerr]ula for U)[OCRerr]k, the weight of term Tk in document D[OCRerr] is [OCRerr][OCRerr]ik fik * log([OCRerr]'/il.k) (1) ;Zk[OCRerr]=[OCRerr](Lk * log(A"/ilk))2 where fik is the occurrence frequency of `I[OCRerr] in D[OCRerr], AT is the collection size atid 1[OCRerr]k the number of (1ocuiiients with term 1[OCRerr] assigned. The factor log( [OCRerr]V/Il.k ) is a.n inverse collectioii frequency factor which decreases as ternis are used widely iii a collection, and the (lenominator iii expression (1) is used for weight nor niMi zat ion. The terms Tk included in a. given vector cali in principle represent any entities assigned to a. document for content identihcation. In the Smart cotitexi., such terms are derived by a. text transformation of the following kind : [2] 1. recognize individual text words 2. use stop list to eliminate unwanted fuuction words 3. perform suffix removal to genera.te word stems 4. optionMly use term grouping methods based on a. statistical word co-occurreilce, or word adjacency, computation to for in terni phrases (alternatively syntactic analysis computations can also be used) 5. assign term weights to all remaining word stems and/or phiase stems to form the term vector for all information items. 60