SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
Ouce ter[OCRerr] vectois ale availa.l)ie for all iii forina.tio[OCRerr] itetus, all .[OCRerr]ii l)s'equefli. plocessilig is based on
terui vector m[OCRerr]ipiila.tions.
The fact tha[OCRerr] the indexiug of 1)0th (loctlments and qiiei[OCRerr]ies, is (`oml)letel[OCRerr]' automatic inea[OCRerr]s that
the results obtajued are reasona.b{v collectioii indepejident a ud should be V(ili(l across a [OCRerr]vi(le rafige
of collections. No hitnian expeitise in t lie sii bject n[OCRerr]a.tter is Ie(111i1'ed for cit lici the initial collectioii
creatloil, or the a.ctii al qnery for inn lat ion.
Text Similarity Computation
\[OCRerr]hen t lie texts are represeuted l)v t([OCRerr]I'Iii vector of the foi'i[OCRerr]i L)[OCRerr] ( U)[OCRerr]J [OCRerr]I'i2 . . , U'[OCRerr][OCRerr] ) and I)[OCRerr]=
([OCRerr][OCRerr]`j1, *[OCRerr]`j2[OCRerr] ` ivJ1 ) for docti itients D[OCRerr] a.II(l I)[OCRerr] a. Si flu laril V ( [OCRerr] ) (`()I[OCRerr] plitatioti l)('t[OCRerr]V('C1I t[OCRerr]vo items cali
conveit iently be o l)ta.i iie(1 as 1. ii e 11.111 Ci' l)i'o(l ii ci. l)et[OCRerr]veeii (`01' respond I ug [OCRerr]veiglited ten n vector as
follows:
`[OCRerr]( I)[OCRerr], D[OCRerr] ) [OCRerr] ( U'1A * u',A ) (2)
k=1
Thus, the similarity' between two tex is ( [OCRerr]vliet her (juCly ol' (bc ii i[OCRerr]ieii1. ) dej)eil(15 [OCRerr] 1, he weights of
coincidiug terms in the two vectors.
Iiiforma[OCRerr]ion retrieval aud text liuki ii g.' sy stenis l)a.se(l oil the use of gI()l)('i.l texi siiiiila.i'i iv meastires
such as tlia.t of eX1)I'CSsioIi (2) [OCRerr]vil] be successful wlieii I lie coiiiii[OCRerr]oii terilis iii i lie two vectors are in
fact tised in se[OCRerr]a[OCRerr]ticall'y Si liii lar way's. Iii hiahy cases ii. ihay ho wever lia.ppeii thai. Ii ighly'- weighted
terms I hat coiitribnte substautially' to t he text siiiiila.i'ity are Sd [OCRerr][OCRerr]anticalIy (tistilIct. I[OCRerr]or example a.
sound may be an a.u(lible l)henomeiloii or a body' of water.
In determining the meaning of in(lividlial words we take advice from [OCRerr]\`ittgeiistein and others
who suggest that text understanduig ni list l)e l)a.se(l oil a. St tidy of how text words are used in the
language ( word use9' theory of text nieaning).[i 1] Iii a illecha.ilize(l text [OCRerr]a.iiijiiila.tion environmeut,
word use" may be interpreted as the coiitexts in which the words are use(l iii the texts iii which
they occur. The assumption then is that identical woI'(ls used in ideutical coiitexts ( that is, in
substantially similar j)[OCRerr]i.ases, sentences au (I l)aragral)liS ) are in faft seiiiantically lioniogeticolis.
Coutrauwise, siiiiila.r woi'(1s such as "soil n(l" are exj)e(.te(l 1.0 occ iii' iii (I iffereut local euvi ron inejits
when they represent (Ii ffereni (`ii tItles such a.s l)o(lies of water an(l aii(l I l)le l)llenonieiia.
To detect similai'Ii.v of local tei'iii eli \`I `on hid 1.5. we (`a.i'i'V 0111. text Si liii Ian tv iIl('asllrerlieiltS
such as those suggested b,[OCRerr],' exp ressIon (2), l)ut aj)j)lIe(l 1.0 sitiall text. liii its such as text senten('eS
and text paxagraj)hs. Two texts are t lien accejit e(l as relate(l ouly when a 511 fficieiii.ly high glol)a.l
text simila[OCRerr]ity exists, as well a.s sufficient local text. siniilai'ii les In the forni of sImIlarities between
sentences and/or paragraphs iii the texts tiiidei' stli(ly.[T.([OCRerr]]
A complete text retrieval svsteni l)ase(l oil text Sii)iIla.i'It'v con I J) 11 tatlous is t heti geucrated in the
following way:
1. formation of term vectors for the text itenis
2. computation of text similarities, aud elimination (if text pairs with lusufliclent global text
si inilarities
3. computation of local text similarities foi' the reinaiiiing texts
4. retrieval of text iteuts with stiflicleutly large global a 11(1 local similarities
5. use of user relevance judgmeiits for replii'asing of search i'eqtiests usIii[OCRerr] iclevance feedback.
61