SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
I[OCRerr]he [OCRerr]veigliting sclleH)( `I[OCRerr]( (I ill [OCRerr] ..[OCRerr] (l(t('rlfli iie[OCRerr] \v lid her 1. lie eli Ii we iii (lexi if) [OCRerr] l[OCRerr] J)lo[OCRerr]l{li (`(11[OCRerr] lie (lone
in oiie [OCRerr])a.ss ol' reqili! ( [OCRerr] [OCRerr] 0 1 lie [OCRerr] ii(l(') `(I `, [OCRerr]IJ[OCRerr] 1[OCRerr]'I [OCRerr]vei()li1 [OCRerr]v Ii i('li \ve VCCOIH men(I [OCRerr]vheii wiothing is
kiio[OCRerr]vn aliout the collc( I loll w[OCRerr] a si r('i.iglit 1[OCRerr][ * idf COsille ll0l'Ill('lllZCd [OCRerr][OCRerr]ei{,rllt ( [OCRerr] ). t; w fow'tunL'telv
tile "i(lf[OCRerr]' [OCRerr]`aiue c([OCRerr].[1liot l)C (01111)111 (`(1 \viI liowi t l[OCRerr]wio[OCRerr]vi wig I lie (locH iwlent fre(Iwlellc.[OCRerr]' of tlwe terill. `fhwis
(1.fl [OCRerr]cciwra.te i(lf l.e(j liii ( [OCRerr] (I I \[OCRerr] o l);lss (ilgorit Ii iii I.lie fI w'st fi 11(11 wig t lie collectioli l'l'('(J 11(11 CV Of (1.11 terwii[OCRerr]
(1.11(1 the second a{'l lldl[OCRerr] (1[OCRerr]5lgli I wig l lie I 1*1(1 f [OCRerr]veiglit . .\l I ew'wi[OCRerr]I i Ve olie l)('iss [OCRerr]`.eigliI I wig .[OCRerr]clwe[OCRerr]es are
(tiscilsse(I at the eli(l of the tra(le()fl' (lisc lls.[OCRerr]i()n . [OCRerr]iitil thewi (`III I wi(lexI ng ni li[OCRerr] (I is(.'llssed [OCRerr]vIll lie t[OCRerr]vo
pass rIw 115 indexiwig lie (lo('H iwiewil 5 \\`II.li [OCRerr] Ii i[OCRerr]Ic [OCRerr]veigliI
Intermediate Document Vectors.
t[OCRerr]vo [OCRerr])a.ss approach tli[OCRerr].t iw[OCRerr]e[OCRerr] liii lii nw[OCRerr]'1l sJ)('1.('C involves (101 Ilk); slejis 1.1 th wowigli I .[OCRerr]l (`l,l)()ve On pa[OCRerr]s olie
but insl cad of [OCRerr]veig1it I ng [OCRerr]lnd [OCRerr]tow'ing the (`ict iwal vectors. jIlsI keep tI'('[OCRerr]'k Of 1 lie (`01 leclion frequency'
of each terrn. filcH l)d[OCRerr][OCRerr] 2 i epe[OCRerr]'i.I.s sl.el)s I -4 1311t then (`(`Ill go [OCRerr]`i.li('[OCRerr]'i(l awul (`()IH 1)11 I.e t lie [OCRerr]veight in
1.5 aud store the VC( t Ol l{ I N I iii `l[OCRerr]i i)le I gives i.he ii liii wig figii res fow' ills (`[OCRerr]lil)l'()(i('lw : .1.5 hours
for pass 1 and 4.9 liowiw [OCRerr] tow p[OCRerr][OCRerr]:[OCRerr]s 2 ( 1)1.55 2 t[OCRerr] kes lougew' l)('c('l use Ii. wiee(ls 10 (.`onsl.w'iwcI. l.lie inverted
index).
i[OCRerr]ii aiterijative l 0 t his al) 1)1 ()(l( Ii i[OCRerr] I 0 sl.()w'e H Ii [OCRerr]veighI elI (lo(' Ii lii Cii I vel' low's ( as (1o(' will wewi I. vectors
in not iii invel'te(l i wutex f;[OCRerr]riii ) tow p 1 1. ?1[OCRerr]lieii l)('155 2 C(i Ii igiioI'e s[OCRerr]('l[OCRerr][OCRerr] I. I 1 .1 a.li(l go (lirectl,v to
the [OCRerr]veigliting aud iii verted ill (le\ (on[OCRerr]l I ww('l lou . \s f[OCRerr] I; N 2 s1io[OCRerr]vs tli is is iii ii cli (IH ickew' (4.7 hours
for pass 1 and 0.7 hoiw rs fol' pa[OCRerr][OCRerr] 2) I) it at a cost ol' (loll 1)11 wig the awiwount Of' disk si)a.ce needed
Obviously the choice of thcse I \[OCRerr] o appi'oa{'.hes de1)eii(ls owl [OCRerr]vliet,1ier iii (lexi wig tiwne ow' (iisk space
is i[OCRerr]'iorta.nt to the (fat abase a(l 111111 lbI I [OCRerr] 1.01'.
Stopwords
No retrieva.l syste[OCRerr] \vaiits to sl ore Iwivel'te(l I wid ices fow' all [OCRerr]`oi'ds ill the text ( at leasi. fow' retrieval
purpos('s ) . \\[OCRerr]ords like `tlie[OCRerr]' `Of" al (I "a.[OCRerr]' ai'e hot wisefww I fow' (I isti ii giw isli I wig reIev[OCRerr]'i.iil (loch wiwewits
aud take lip an cxl. reiwiely la.w'ge ainoiw wil. of' si)a.ce siwice 1. liev occiw I' lii nea.i'I'v evewy dociiw[OCRerr]ewit. r1[OCRerr]he
question is ho[OCRerr]v ina.iw,v stoIi[OCRerr]vow'(ls to igliowe. S \i ,:\ I[OCRerr][OCRerr]I Ii as a. sla wida w'(I collectiowi I wide1iewi(lewlt list of
571 [OCRerr]vords that seeni to con vev Ii 1.1 Ic ill forlIw at lou (`1.1)0111. w'elcv[OCRerr]'i ii [OCRerr]`c. 13.111 111(11 \`i (1ww al (`01 lect bus often
contain an additional iiuinlici' of \vow (l[OCRerr] t ha I give 111.1 Ic iii f(iw'wwia.I.i(iwi for 1. lie l)a.w'l.icn law' 51 I),ject wn[OCRerr]'i.tter
covered by that collection.
Three runs RUNS 3-5 [OCRerr]vere wii[OCRerr]de oii 1 I[OCRerr] E(;' a.(l(l mo I lie niost fre(1 Iwently occwirring [OCRerr]vords oc('iwr-
ring in TREC to the standard [OCRerr] V l[OCRerr]T top[OCRerr]vor(l list.. i[OCRerr] I N 3 added l.Iie 69 l.cw'ms occww w'I'i wig in iwiore
than 10% of' the collection.' R.t' N 4 (1(Ide(l 350 ternis Occiiw'w'iiig in more thawi 5% of the colleclion;
and Ru N 5 added the 12(56 tew iii[OCRerr] O( ( Ill I'Iiw(' in iiiow'e tha ii 2[OCRerr]/(. of t lie collection. ?I.[OCRerr]lie sI)acc sa[OCRerr]ings
ale sulistantial ranoilig frouw 77(. 10 I ;3V. of Ilie I wivew'I e(l file size [OCRerr]vith a. cow'w'esl)ond iwig savings iii
indexiug ti[OCRerr]e. ltd w'ieva,I tuwie 15 (`veIl Iiiorc aff'c('ted , as lii [OCRerr]`iwi y of tlic vci',v bug iii verted lists for
commoii words no louger have to lie (lealt [OCRerr]vith. I[OCRerr] U N £1 saves 5'1V1.. awid 1[OCRerr] I' N 5 79%.
The penalty that needs to lie l)a.id fow' these saviligs is the w'etrieval effectiveness. There `5 no
penalty for Ru N 3, aud the reduced effccl.ivcncss ill R IT N 1 is insigwii{icaiit.. hut Ru N 5 loses about
15%. Except if you need maximal effectiv[OCRerr]iess RITN 4 [OCRerr]vo1.1l(l sceni to lie worthwhile in practice.
One other potential prolilem with removing the most common words (if the collection is user
mystification. U. sew's can understand that woi'ds like [OCRerr]the[OCRerr]' (Ion `I help retrieval but. whay be snrpw'ised
when sentences like
The head and president of an Amei'ican cowwil)nter systeni colilpa.iiy liase(t in \.\Jasliington
said she expected to in('ikc a. niillioii systems hy the (`lid of the year.
67