SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman Official Runs (Ad-hoc Queries) rI[OCRerr]he overall I)roce(i II IQ II.[OCRerr]c(l for l)Ot Ii of II([OCRerr] offlci[OCRerr]i i[OCRerr]iiii.[OCRerr] ( on([OCRerr] [OCRerr]vitli s.iI[OCRerr]gJQ tQI[OCRerr]iIIs oil Iv. the other incindi ug phrases) i.[OCRerr] gi V([OCRerr][OCRerr] In figti r([OCRerr] 1. Automatically index document collection D1+D2 using tf * idf weights, cosine normalized (ntc) creating inverted file. 2. Automatically index query collection Q2 using tf * idf weights, cosine normalized (ntc) 3. For each Q in Q2 3.1 Compute sim of Q to each of documents in D1+D2 keeping track of the top 500 documents. 3.2 Re-index q, breaking it down into sentences. Each sentence is reweighted with tf * idf weights (ntn) and formed into a vector. 3.3 For each D in the top 500 global sim documents 3.3.1 Re-index D, breaking it down into sentences. Each sentence is reweighted with tf * idf weights (ntn) and formed into a vector. 3.3.2 Do a pairwise comparison of every sentence vector of Q against every sentence vector of D. If some sentence match satisfies the local criteria, add a large constant to the global sim of Q vs D already computed. 3.4 Return the top 200 documents out of the set of 500 documents. Given the method above, first will be the documents satisfying the local criteria, sorted by global sim, and then the documents not satisfying the local criteria, sorted by global sim. r1[OCRerr]he local criteria. varied Iii the t[OCRerr]vo offici[OCRerr]i.I ni is. Single Term Global/Local Matching I `idexiug the docu[OCRerr]ent ( o1I( ( t lou for the siugle terni `.1111 ( stel) 1) took [OCRerr] (3' l[OCRerr] tJ lioii is. creating an iliverte(1 file of 690 Nib[OCRerr] te[OCRerr] The actual retrieval took t lo [OCRerr] C'. i[OCRerr]tJ sccoIi(1s for all 50 (ilierics [OCRerr]`i.Ii(I coflsid('r('1.1)ly' longer in elapsed tinie (a. large amount of tlm( being speilt. [OCRerr]vaitiiig for (115k io). For ea{'li qnery the text of 500 docunients had to be read in 1)loken (1o\v11 into sentelices, ii[OCRerr]dexe(19 aud [OCRerr]veighted [OCRerr]vitli i(lf [OCRerr]veights. rI'hen every indexed sent[OCRerr]n(( In the (iiiery lla.(i to I)e coIiil)are(l against every in(Jexe(1 selitence in the document. The local matching criteria a.re 1)ased on the (`etectioli of matching siibs.trllc1.lires in both the (luery texts and the texts of retrieved doc[OCRerr]iIi)ents. [OCRerr]`he criterion used for the saml)le rims required the presence of a.t least one paii. of niatchiug text sentelices [OCRerr]vitli a. l)air[OCRerr]vise sentence siiiiilarity of at least 100.0. rfliis was (`lioseli to I)e high cuougli so I.liat veiN, f('w niatches of only one terni [OCRerr]vould satisfy the threshold but low eliolIgIl so tIi[OCRerr]'i.t sevel'('il iiie(l 111111 weighted ter[OCRerr]s (`0111(1 niatch and reach the threshold. 63