SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
Official Runs (Ad-hoc Queries)
rI[OCRerr]he overall I)roce(i II IQ II.[OCRerr]c(l for l)Ot Ii of II([OCRerr] offlci[OCRerr]i i[OCRerr]iiii.[OCRerr] ( on([OCRerr] [OCRerr]vitli s.iI[OCRerr]gJQ tQI[OCRerr]iIIs oil Iv. the other
incindi ug phrases) i.[OCRerr] gi V([OCRerr][OCRerr] In figti r([OCRerr]
1. Automatically index document collection D1+D2 using
tf * idf weights, cosine normalized (ntc) creating inverted file.
2. Automatically index query collection Q2 using
tf * idf weights, cosine normalized (ntc)
3. For each Q in Q2
3.1 Compute sim of Q to each of documents in D1+D2
keeping track of the top 500 documents.
3.2 Re-index q, breaking it down into sentences.
Each sentence is reweighted with tf * idf weights (ntn)
and formed into a vector.
3.3 For each D in the top 500 global sim documents
3.3.1 Re-index D, breaking it down into sentences.
Each sentence is reweighted with tf * idf weights (ntn)
and formed into a vector.
3.3.2 Do a pairwise comparison of every sentence
vector of Q against every sentence vector
of D. If some sentence match satisfies the
local criteria, add a large constant to the
global sim of Q vs D already computed.
3.4 Return the top 200 documents out of the set of 500
documents. Given the method above, first will be
the documents satisfying the local criteria, sorted
by global sim, and then the documents not
satisfying the local criteria, sorted by global sim.
r1[OCRerr]he local criteria. varied Iii the t[OCRerr]vo offici[OCRerr]i.I ni is.
Single Term Global/Local Matching
I `idexiug the docu[OCRerr]ent ( o1I( ( t lou for the siugle terni `.1111 ( stel) 1) took [OCRerr] (3' l[OCRerr] tJ lioii is. creating an
iliverte(1 file of 690 Nib[OCRerr] te[OCRerr]
The actual retrieval took t lo [OCRerr] C'. i[OCRerr]tJ sccoIi(1s for all 50 (ilierics [OCRerr]`i.Ii(I coflsid('r('1.1)ly' longer in elapsed
tinie (a. large amount of tlm( being speilt. [OCRerr]vaitiiig for (115k io). For ea{'li qnery the text of 500
docunients had to be read in 1)loken (1o\v11 into sentelices, ii[OCRerr]dexe(19 aud [OCRerr]veighted [OCRerr]vitli i(lf [OCRerr]veights.
rI'hen every indexed sent[OCRerr]n(( In the (iiiery lla.(i to I)e coIiil)are(l against every in(Jexe(1 selitence in
the document.
The local matching criteria a.re 1)ased on the (`etectioli of matching siibs.trllc1.lires in both the
(luery texts and the texts of retrieved doc[OCRerr]iIi)ents. [OCRerr]`he criterion used for the saml)le rims required
the presence of a.t least one paii. of niatchiug text sentelices [OCRerr]vitli a. l)air[OCRerr]vise sentence siiiiilarity
of at least 100.0. rfliis was (`lioseli to I)e high cuougli so I.liat veiN, f('w niatches of only one terni
[OCRerr]vould satisfy the threshold but low eliolIgIl so tIi[OCRerr]'i.t sevel'('il iiie(l 111111 weighted ter[OCRerr]s (`0111(1 niatch
and reach the threshold.
63