SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
In o[OCRerr]r rl'ns, tuc [OCRerr]oI)[OCRerr]i I)I[OCRerr]r[OCRerr]S( VHI[OCRerr] R I: N I t (10(5 ([OCRerr]1)oIIt :)V; I)Qttev H f(tri(v[OCRerr]I (f1'Q.ctive1[OCRerr]Q.ss
t1[OCRerr][OCRerr][OCRerr]i I[OCRerr]u N 1 at a cost of I IIcr('('[OCRerr].S('(I I I[OCRerr](I('xI II[OCRerr] iI IIIQ IH(1('XI Ilk) sI)[OCRerr]{c' [OCRerr][OCRerr]I[OCRerr](I rcbt I[OCRerr]I('v(lI t IIIlQ. .[OCRerr]I iiilla.rly, tIl(
toc[OCRerr]I/gIo1)aJ I)hraSe run RI N I 2 is (`11)011 :7)V1, 1)cttci' t 11(111 11I(' 1oc(11/gIol)a.1 Si I1k[OCRerr]1[OCRerr]' terili 1.1111 [OCRerr]vIth the
Same I)a.ralneters. [OCRerr] (I N 9.
Alternative Weighting Schemes
The t[OCRerr]vo passes uceded for idf [OCRerr][OCRerr]`eIgIits iii (locuments aic (i (lefililte hurdeii . in earlier experi[OCRerr]ents
[OCRerr]vit1i ot' her collectious [OCRerr]ve fonud that hot usilig i(If I H docti lieu is ( [OCRerr]\` hue sill I us ug It Iii (Illeries) [OCRerr]va.s
very reasonable jilsi a bit less efrective tlia.ii tisilig i(If.
[OCRerr]V'Jien tried on T][OCRerr]EC1, if-coslue nornialized ( I[OCRerr] i[OCRerr]c ) doci il)eiil [OCRerr]veig1its even I)Iove(I 1,0 be iiia[OCRerr]'gin[OCRerr]ly
better tli[OCRerr] the *i[OCRerr]ic (locitmeilt [OCRerr]veig1its. ?1:lie OlIC [OCRerr] 11 I1( [OCRerr]veigh is Ru N 1:3. took less luau half the
total lildexing tiltie of R I' N I `l[OCRerr]Iiere is ho q tiesi iou t Ii[OCRerr]i I I his is a iliajor a(l V(i iii age of i[OCRerr] i[OCRerr]c. The
possible ad vantage l.li at i[OCRerr]ic [OCRerr][OCRerr]eiglits iii Iglit Ii ave is in fee(l l)a('k )V here H()[OCRerr]i11 a I ly [OCRerr]I liery [OCRerr]veights aud
(1ocn[OCRerr]ent [OCRerr]veiglits are coni 1)1 ile(l to foi[OCRerr]in I1('\V [OCRerr]I iierV [OCRerr]veiglii 5.
\ vaii[OCRerr]t of teriii fie(luency [OCRerr]veiglit lug I lia.t[OCRerr]s l)eeil iii `, [OCRerr]l :\ It'l foi a coil l)Ie of vears is the I
sche[OCRerr]e ( eg, it.c inst ea.(t of 11 Ic). I stan(ls for log; (1.0 [OCRerr] lii (1 1.)) is lise(I instea(l of if I lie nuniber of
times a. term occurs in a. docunient . l[OCRerr]Iie goal is to (1o[OCRerr][OCRerr]n[OCRerr]veIglii tile illiporil ice of the tf factor in
collections [OCRerr]vhich have very long (loch ucilt 5. `1[OCRerr]lia.t fits 1[OCRerr]l[OCRerr] i'.[OCRerr](2 very [OCRerr]vel1.
I[OCRerr]u N 14 and 1[OCRerr]u N 15 describe usilig lIlc document [OCRerr]veig.li1.s a.Ii(l Ii[OCRerr] [OCRerr][OCRerr]tiery [OCRerr]veigIits foi single terms
and phrases respectively. Fi[OCRerr]hey [OCRerr]voi[OCRerr]k reiiiarkablv [OCRerr]vel1 a 1)0111 20V. l)ettei' l.lia ii I lie c()l[OCRerr]respon(1ing
i).iic.nI.c runs. Thai 5 alt enormous iflil)rovemeni.
:\ciua.1ly, half oft he ini1)roveineli{. is soine[OCRerr]vliat (1tiest lona.I)le. ;\ l)oti i i0V(. out of the I otal 20V(., 15
(1tie to the lic (`tiery [OCRerr]veights a.ii(l 1 he oilier half is (1 lie 1.0 the Ii,.c (loctilneili \\eigllis. ][OCRerr]he document
[OCRerr]veiglit ilnI)rovemeiii is I'ea'sona'I)le: I, Iiere[OCRerr]s lT0[OCRerr]000 (lo('11111e111 s of all sizes iii(lexe(I [OCRerr]vith 1,[OCRerr]c [OCRerr]veiglits.
I feel the strong imj)rovenieni dIle to //c (lliCI'Y [OCRerr]veiglits is almost cei'iaiiilv ah artifact of the TREC:
queries, and possil)ly even an art fact of the secoud (jilery set ( (Ilieries Si - 100). 50 (Ineries is a. sm[OCRerr]l
enough nuniber so that ralidoni effects caii be un l)oI.ta.Iit..
\Il average `iser siii)l)lied query [OCRerr]vill hot have the (list ri I)1i1.ion of terilis thai the f[OCRerr]ll E(7. queries
have. For that reason, lic ([OCRerr]nery welolits call not; be geilcially i'ecomlneIld(.'d. Iii tests on small
collections, Itc performs ai)oYii the same as?[OCRerr]ic. It sholil(l iii liii ii to lise lIc, 1)111. (ion't I)et the farm
on it for TREC 2!
Failure Analysis
There seems to 1)e little consisteiti that can l)e sai(I a,I)o1i I the perfoi'ma.iice ol Siii[OCRerr] it. iii the ad-hoc
experiments. Sm art. does coni pa rat. I vel y l)eI I ci ( wlieii (`Otil ia re(l wit.li the fli( (I iaii [OCRerr][OCRerr]il lies) on queries
[OCRerr]vith a. lot of relevant docunlelits, as opposed to those wit Ii few rele[OCRerr][OCRerr]tiit (10(11 iiient[OCRerr] [OCRerr]vliere it often
does substantially worse. But ii is haid to tell whet icr that. is a feature of Ilic [OCRerr]Vst(ill or the queries.
For sonie queries, the Sniari perfoimance is very poor, I)ccaHse the quely sti icini ( is ignored. That
is especially' true of queries using NO[OCRerr]1 clauses. Uhe N()?l) is ignored an(I I lic followiug words a.re
treated as positive relevance ili(Iicat toil[OCRerr]
In general, the local match req iiii( iiient does not have as l)Ig of all effect on queries as it has
on other collections. There are d((iiilt( [OCRerr]H( cesses; for exail) pie., query 69 on "Attempts to Revive
the S[OCRerr]1A' II Treaty.". The local I c(jHii ciii( lit. rejects (`ill doctimeuts that. deal with 111(1 ustrial salts
insstea(1 of a. peace treaty. Bitt iii I 11 [OCRerr] ( 1 ii least, there are few queries in which anibiguous [OCRerr]vords
played an important part.
70