SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
Cofltd' I'll 1)0 iiidc'x[OCRerr]i hl( \vor(ls at (`III! .\ II \voI(l[OCRerr] (l.I( (II nolig I lI(' [OCRerr]1 (`111(1 `1I'(l [OCRerr] \l \ l[OCRerr]'l [OCRerr]Ioj)[OCRerr]"oi'(l[OCRerr] OP OCCi] 1'
fl101([OCRerr] t1[OCRerr][OCRerr] 10V1 of t hE' ?1'l[OCRerr] {( (IOCIIIIIQIl Is.
Stemming
[OCRerr]t('ini[OCRerr]iing is [OCRerr] of tIIO5Q aI'c('[OCRerr]s [OCRerr]Vl1('1'Q IllE' I1'([OCRerr]{l('off[OCRerr] C('1 ii f)Q so1I1('\vlIat 51 l)IlQ[OCRerr]2.:[OCRerr]]. ][OCRerr]llQ st[OCRerr]d[OCRerr]i'(i
[OCRerr]NI[OCRerr]\ l[OCRerr]i) a{)l)roa('11 lIs('5 full StQlll U) I hg \vlI('l'(' most Siiffixcs (`I I'(' I'Q1l1ovC(l 1[OCRerr] UN I l[OCRerr] UN Ci 1'CIIIOVQ5 only'
j)l 1I1'[OCRerr]'[OCRerr]5 (`mEl i[OCRerr]u N T (foes 110 Steni liii ug [OCRerr] I [OCRerr][OCRerr]ll . 11 ofi[OCRerr]-ci tc(f (I v('1[OCRerr]vl)aCk of lot (loilig fii II ste[OCRerr][OCRerr]iflg Is
li0\V('VQi' tilit is re[OCRerr][OCRerr]soii[OCRerr] l)lv ill,
the iliCI'Ca.5e i[OCRerr] the (liCtioiiai'v size,' sigflifiC('llil l'[OCRerr]i p [OCRerr]oi'[OCRerr] I in I)ortaflt is
the Inc lease ill iiiverl.ed file sIze (I lie to ii[OCRerr]ii It.I l[OCRerr][OCRerr](' forius of I lie s[OCRerr]'i me [OCRerr]voi'4 OCCI i'1'ilig ill [OCRerr] do[OCRerr]u[OCRerr]eiit.
l[OCRerr]1urai lilElexing inC1'('.[OCRerr]se4 t lie inverted file l),V [OCRerr] [OCRerr]n(l ilsi hg no 5I.[OCRerr]ns ii) creased it b,y I 2.7%o
In(lexing spee(f is glveii a's an a(fvali t.[OCRerr]'ige of hot Elollig fii II SI eniull [OCRerr] l)iit (`igaln t fiat's i'easofla.l)l,y
insig'iiFic[OCRerr]t. If full sieni inilig is efh'[OCRerr]ient I lieu the cost is (`il iiiost completely' (`oh iite1'-l)('ilaiice(l by' the
cost of (`reating a lai'ger inverte(f ludex. l[OCRerr]et i'Iev('il speed Is not. iioriii ally' nientlone(j (`15 (`1. (1 IsadvaIlt,1'.ge
of full ste[OCRerr][OCRerr]ing, 1)11 t seenis to be a (`Oil sI(le1'('i l)fe f('[OCRerr]tor. I[OCRerr] [`N 6 ( [OCRerr] ii rais ) Is :[OCRerr]()7( f('i.sI CI' I[OCRerr]liaii I[OCRerr]u N [OCRerr]
(full).
l[OCRerr]et i'Iev('[OCRerr] effectl veiiess IS ofteii gI vi i)g (`IS (`iii (`1(1 [OCRerr]`a iii (`ig(' of' f'ii II St('lii 1111 hg ovel' ,j list. 1)1 [OCRerr]i i'al i'einoval,
1)11 t the results hei'e agree [OCRerr]viI. Ii otliel' i'Q('('ii t 1'('s ii 1.5: the (II fl'('i'Cli CeS l)('t[OCRerr]V('(.'Ii I lie t\Vo (`i.l'C I isiguificant
No steiiiuiing at all Is notICe('i.l)ly' \VOI'5('. 1)111 hot by' ah extr[OCRerr]()i'(lIii('i.ry' aiiioii lit ((3[OCRerr]X[OCRerr]).
Local/global
I[OCRerr]he basic locai/glol)a.1 aigori t hiii is (lE'5('i'Il)('(l In a l)i'evIolis secl.Ioii of lii Is !`ej)oi'I. . () lii' current
iniplei}ienta,tion is (lesigne(l to Iii crease flexl l)ilItv at t lie cost of' i'eti'ieval ii nie: f'oi' evei'y' (j ii cry', [OCRerr]ve
had to go out an(f ludex aud "veiglit `[OCRerr]O() (lo('li hid Is fi'oiii SE' l'('i.tcli . [OCRerr]l'li at ii icalis at ret rIev('iI tinie, [OCRerr]ve
can (10 any' sort of iii dexIng. [OCRerr]veIglitI hg. `111(1 lest l'i('tions [OCRerr]"[OCRerr]` 11 ke. Iii (`ill Ol)QI'('1 I.Ioii ("1 syst eli), lio[OCRerr]vevei'9
it's exJ)ected that tIlE' local i'est i'Ictloii 01)CI'('i I lou [OCRerr]voul(1 l[OCRerr]e (lQtE'l'liii iied iii a(l valice(l ` (`1.11(1 [OCRerr]voiild use
I)reindexed sent CIICQ vectors. ?I'h us I udexlug tI [OCRerr]ne and sf)ace `volil(l i nci'e[OCRerr]'i Se, but ret rieva] speed
[OCRerr]vould go to a rea SOli('i ble level (it cii rl'eliI.lv lakes 1 iI liies as bug for ret i'Ie"al [OCRerr]vIth foca.1/global
inatchl ug).
Ru N 8 gives the tilnilig and effectiven[OCRerr][OCRerr][OCRerr] ligui'es f'oi' the fi i'st Cloniell offIcial i'un. EfFectIveness for
that run [OCRerr]vas disappointl hg; al)oli t the saiiie a S I.[OCRerr] IN I [OCRerr]vIl.lioiit lo('al inatcliliig. ifo[OCRerr]vever, bet[OCRerr]veen
the tune [OCRerr]ve sllbniitteEl oil 1' fii'st officIal I'll!) ali(l the tIlii(' Of' oil]' se('olid officIal Pun, [OCRerr]ve \vere able to
get a. better local restrictloii method [OCRerr]v0l'kIlig. I.' slug the restl'ictIoll iliethod of the SC('Oli(1 official
run ( dESCribed ill the iiialn portioji of the [OCRerr]vrI tell1))' 1)11 t Oil t lie si ugle terhi collectloii (0111' second
official run used ph i'ases ), `Ve get a i0/[OCRerr] iiii f)l'o\'emeiit ( f[OCRerr] UN (,) ),
Qnery Optimization.
In the stop[OCRerr]vord SC('tioli al)ove, the tl'('i(leofF bet[OCRerr]veen i'eti'Ieval speed anEl l'eI.l'iev('il efFectIveness ,vas
examined by completely removilig long Iliverled file sto1)[OCRerr]vol'(l Ilsi s fi'oin I. lie collectIon. `J'liis tradeoff
can be exa[OCRerr]ined dIi'ecl.lv at i'etrleval 11 Ilie I),' (`Onslderilig sclteiiies 1.0 avoi(l looking at the longest
inverted lists for query terms unless Forced to.
The basic niethod used here, ( (lescl'I l)e(l Iii more det('iIl iii [1]) Is to sol't the (j llCl'y' by' decreasing
query' [OCRerr]veight ( thtis hopefully' pllttIlTg (IilCl'y' ternis [OCRerr]vith bug lists aii (1 thereFore lo[OCRerr]v 1(1 f' at the end),
go through the quel'y' term- by-tei.iii . aii El sI.OJ) [OCRerr]vhen it Is gii `ii'alltee(l that a c.ei'talii nnmber of "good"
documents [OCRerr]vi11 appear in the final list of 200 to1) Eloculilents. Here. `[OCRerr]good" means retrieved In the
top 200 if all query teruis are u se(l.
68