SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman Phrase Global/Local Matching The phrases being used [OCRerr]vere t\vo-terII) SNl[OCRerr][OCRerr]i[OCRerr]T adjaceucy I)1Ir[OCRerr]ses. I[OCRerr]IIv([OCRerr][OCRerr]es [OCRerr]vere ([OCRerr].dj(-l.cent non- stop[OCRerr]vords, term components stem me(1, that occutred [OCRerr] least 2:5 times in tile lea[OCRerr]ing (Dl) doc'i- inent set. The term components [OCRerr]vere put ijito a1phai)etic([OCRerr] oi'der, tIi[OCRerr]is the text j)lirases [OCRerr]inforrnation retrievd[OCRerr].l" and "retrieving information[OCRerr] both mapped to the same phrase concept. The phrases [OCRerr]vere treated as a separate concept type (ctype) within an ludexed vector, and ha(1 their own dictionary and juverted file separate from those of the single terms. The components of phrases rei[OCRerr]ained in the single term ctype. Determination of phrases took 5.S hours, fluding 4[OCRerr]7OO,OOO phrases occunuig in Dl at least once. Of those phrases 15[OCRerr],OOO occllrre(1 at least 25 tii[OCRerr]ies. These phrases were theii put into a dictionary and nse(] as controlled vocal) ii Iai'y for phrase[OCRerr] [OCRerr]vlien (101 hg the I Il(lexing of' D i+D2. The single term indexing reiiiai ned exactly [OCRerr]".5 it. was in the siugle term run ( teri[OCRerr]i[OCRerr] occurring in phrases were not removed from the vector). Both the terms and phrases were giveli a `[OCRerr]natural" If * i(/f weight (le, the i(1f weight was based on tile collection fre([OCRerr]uency of the plii'a.s:e itself' r[OCRerr]tlier tli[OCRerr]'in being a fiiuctioii of the i(1f values of the single term components). The cosine norlllalization was kandled Iii the following way. [OCRerr]ll terms in the vectors had their weight divided by' the cosine leiigth of the single term sub-vector iiistead of the vector as a whole. Thus the weights of single terms in the final vector were exactly the same as if phrases were not being used. Yt i'etrieval tilue, the effect of a. l)llrase iiiatch was divided by 2. (The same effect could have l)een obtained l)y dividing the indexed weight of' a phrase l)y sqrt(2).) Indexing the document collection with l)hrases took I 0.6 hours, creating an luverted file of 840 Nlbytes. The actual retrieval took 2105 C1[OCRerr]U secolI(ls for all 50 queries [OCRerr]`i.Ii(l again considerably longer in elapsed time. more complicated local match criteria was used for the phrase rim. J;'he basic threshold was reduced from 100.0 (used in the single tei'm run) to TS.0, but an additional restriction was placed on the match to ensure that no one term woltid contril)llte more than (15 percent of the computed pairwise sentence similarity for any Sd tence pal I'. Uhis effectivel V eliminates sentence matches due to the presence of a. single highly weighted term. TIle ().`5V() was determined enipirically from tests using the learning query/docunient sets [OCRerr])ercentages railgi ng from 55V([OCRerr] to `,5V( perforuted equally well. The more complicated local matclt, iii conjunction with tl[OCRerr]e phrases 1)10(1 uced a. very significant improvement in the phrase run as opposed to the single- teim ni ii. The II -point average over 50 queries for the single-term run was 0.17:38 while the phrase run did very well at 0.2032. Offical Routing Queries Standard SMART relevance feedback tecliniq nes were Used to automatically con sti'ii ct routing queries to be run on the test set of documeuts ( D2). Each routing query was coniposed of' ternis from the original indexed query plus the "best" 30 terms fi'om the documents in the learning set that were relevant to that query. The weight of each routing query term was a linear combiiiation of the tf x idf weight in the origiiial query, the tfxidf weight in each of the relevant documents, and the t.fxidf weight in a. single non-relevant document. E. Ide's [4,10] feedback formula was used: = QQId { [OCRerr]( (17q) - rci The idf component in the query and document weights was based on occurrences of the term in the learning set of documents only. 64