SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman I[OCRerr]he [OCRerr]veigliting sclleH)( `I[OCRerr]( (I ill [OCRerr] ..[OCRerr] (l(t('rlfli iie[OCRerr] \v lid her 1. lie eli Ii we iii (lexi if) [OCRerr] l[OCRerr] J)lo[OCRerr]l{li (`(11[OCRerr] lie (lone in oiie [OCRerr])a.ss ol' reqili! ( [OCRerr] [OCRerr] 0 1 lie [OCRerr] ii(l(') `(I `, [OCRerr]IJ[OCRerr] 1[OCRerr]'I [OCRerr]vei()li1 [OCRerr]v Ii i('li \ve VCCOIH men(I [OCRerr]vheii wiothing is kiio[OCRerr]vn aliout the collc( I loll w[OCRerr] a si r('i.iglit 1[OCRerr][ * idf COsille ll0l'Ill('lllZCd [OCRerr][OCRerr]ei{,rllt ( [OCRerr] ). t; w fow'tunL'telv tile "i(lf[OCRerr]' [OCRerr]`aiue c([OCRerr].[1liot l)C (01111)111 (`(1 \viI liowi t l[OCRerr]wio[OCRerr]vi wig I lie (locH iwlent fre(Iwlellc.[OCRerr]' of tlwe terill. `fhwis (1.fl [OCRerr]cciwra.te i(lf l.e(j liii ( [OCRerr] (I I \[OCRerr] o l);lss (ilgorit Ii iii I.lie fI w'st fi 11(11 wig t lie collectioli l'l'('(J 11(11 CV Of (1.11 terwii[OCRerr] (1.11(1 the second a{'l lldl[OCRerr] (1[OCRerr]5lgli I wig l lie I 1*1(1 f [OCRerr]veiglit . .\l I ew'wi[OCRerr]I i Ve olie l)('iss [OCRerr]`.eigliI I wig .[OCRerr]clwe[OCRerr]es are (tiscilsse(I at the eli(l of the tra(le()fl' (lisc lls.[OCRerr]i()n . [OCRerr]iitil thewi (`III I wi(lexI ng ni li[OCRerr] (I is(.'llssed [OCRerr]vIll lie t[OCRerr]vo pass rIw 115 indexiwig lie (lo('H iwiewil 5 \\`II.li [OCRerr] Ii i[OCRerr]Ic [OCRerr]veigliI Intermediate Document Vectors. t[OCRerr]vo [OCRerr])a.ss approach tli[OCRerr].t iw[OCRerr]e[OCRerr] liii lii nw[OCRerr]'1l sJ)('1.('C involves (101 Ilk); slejis 1.1 th wowigli I .[OCRerr]l (`l,l)()ve On pa[OCRerr]s olie but insl cad of [OCRerr]veig1it I ng [OCRerr]lnd [OCRerr]tow'ing the (`ict iwal vectors. jIlsI keep tI'('[OCRerr]'k Of 1 lie (`01 leclion frequency' of each terrn. filcH l)d[OCRerr][OCRerr] 2 i epe[OCRerr]'i.I.s sl.el)s I -4 1311t then (`(`Ill go [OCRerr]`i.li('[OCRerr]'i(l awul (`()IH 1)11 I.e t lie [OCRerr]veight in 1.5 aud store the VC( t Ol l{ I N I iii `l[OCRerr]i i)le I gives i.he ii liii wig figii res fow' ills (`[OCRerr]lil)l'()(i('lw : .1.5 hours for pass 1 and 4.9 liowiw [OCRerr] tow p[OCRerr][OCRerr]:[OCRerr]s 2 ( 1)1.55 2 t[OCRerr] kes lougew' l)('c('l use Ii. wiee(ls 10 (.`onsl.w'iwcI. l.lie inverted index). i[OCRerr]ii aiterijative l 0 t his al) 1)1 ()(l( Ii i[OCRerr] I 0 sl.()w'e H Ii [OCRerr]veighI elI (lo(' Ii lii Cii I vel' low's ( as (1o(' will wewi I. vectors in not iii invel'te(l i wutex f;[OCRerr]riii ) tow p 1 1. ?1[OCRerr]lieii l)('155 2 C(i Ii igiioI'e s[OCRerr]('l[OCRerr][OCRerr] I. I 1 .1 a.li(l go (lirectl,v to the [OCRerr]veigliting aud iii verted ill (le\ (on[OCRerr]l I ww('l lou . \s f[OCRerr] I; N 2 s1io[OCRerr]vs tli is is iii ii cli (IH ickew' (4.7 hours for pass 1 and 0.7 hoiw rs fol' pa[OCRerr][OCRerr] 2) I) it at a cost ol' (loll 1)11 wig the awiwount Of' disk si)a.ce needed Obviously the choice of thcse I \[OCRerr] o appi'oa{'.hes de1)eii(ls owl [OCRerr]vliet,1ier iii (lexi wig tiwne ow' (iisk space is i[OCRerr]'iorta.nt to the (fat abase a(l 111111 lbI I [OCRerr] 1.01'. Stopwords No retrieva.l syste[OCRerr] \vaiits to sl ore Iwivel'te(l I wid ices fow' all [OCRerr]`oi'ds ill the text ( at leasi. fow' retrieval purpos('s ) . \\[OCRerr]ords like `tlie[OCRerr]' `Of" al (I "a.[OCRerr]' ai'e hot wisefww I fow' (I isti ii giw isli I wig reIev[OCRerr]'i.iil (loch wiwewits aud take lip an cxl. reiwiely la.w'ge ainoiw wil. of' si)a.ce siwice 1. liev occiw I' lii nea.i'I'v evewy dociiw[OCRerr]ewit. r1[OCRerr]he question is ho[OCRerr]v ina.iw,v stoIi[OCRerr]vow'(ls to igliowe. S \i ,:\ I[OCRerr][OCRerr]I Ii as a. sla wida w'(I collectiowi I wide1iewi(lewlt list of 571 [OCRerr]vords that seeni to con vev Ii 1.1 Ic ill forlIw at lou (`1.1)0111. w'elcv[OCRerr]'i ii [OCRerr]`c. 13.111 111(11 \`i (1ww al (`01 lect bus often contain an additional iiuinlici' of \vow (l[OCRerr] t ha I give 111.1 Ic iii f(iw'wwia.I.i(iwi for 1. lie l)a.w'l.icn law' 51 I),ject wn[OCRerr]'i.tter covered by that collection. Three runs RUNS 3-5 [OCRerr]vere wii[OCRerr]de oii 1 I[OCRerr] E(;' a.(l(l mo I lie niost fre(1 Iwently occwirring [OCRerr]vords oc('iwr- ring in TREC to the standard [OCRerr] V l[OCRerr]T top[OCRerr]vor(l list.. i[OCRerr] I N 3 added l.Iie 69 l.cw'ms occww w'I'i wig in iwiore than 10% of' the collection.' R.t' N 4 (1(Ide(l 350 ternis Occiiw'w'iiig in more thawi 5% of the colleclion; and Ru N 5 added the 12(56 tew iii[OCRerr] O( ( Ill I'Iiw(' in iiiow'e tha ii 2[OCRerr]/(. of t lie collection. ?I.[OCRerr]lie sI)acc sa[OCRerr]ings ale sulistantial ranoilig frouw 77(. 10 I ;3V. of Ilie I wivew'I e(l file size [OCRerr]vith a. cow'w'esl)ond iwig savings iii indexiug ti[OCRerr]e. ltd w'ieva,I tuwie 15 (`veIl Iiiorc aff'c('ted , as lii [OCRerr]`iwi y of tlic vci',v bug iii verted lists for commoii words no louger have to lie (lealt [OCRerr]vith. I[OCRerr] U N £1 saves 5'1V1.. awid 1[OCRerr] I' N 5 79%. The penalty that needs to lie l)a.id fow' these saviligs is the w'etrieval effectiveness. There `5 no penalty for Ru N 3, aud the reduced effccl.ivcncss ill R IT N 1 is insigwii{icaiit.. hut Ru N 5 loses about 15%. Except if you need maximal effectiv[OCRerr]iess RITN 4 [OCRerr]vo1.1l(l sceni to lie worthwhile in practice. One other potential prolilem with removing the most common words (if the collection is user mystification. U. sew's can understand that woi'ds like [OCRerr]the[OCRerr]' (Ion `I help retrieval but. whay be snrpw'ised when sentences like The head and president of an Amei'ican cowwil)nter systeni colilpa.iiy liase(t in \.\Jasliington said she expected to in('ikc a. niillioii systems hy the (`lid of the year. 67