SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
Phrase Global/Local Matching
The phrases being used [OCRerr]vere t\vo-terII) SNl[OCRerr][OCRerr]i[OCRerr]T adjaceucy I)1Ir[OCRerr]ses. I[OCRerr]IIv([OCRerr][OCRerr]es [OCRerr]vere ([OCRerr].dj(-l.cent non-
stop[OCRerr]vords, term components stem me(1, that occutred [OCRerr] least 2:5 times in tile lea[OCRerr]ing (Dl) doc'i-
inent set. The term components [OCRerr]vere put ijito a1phai)etic([OCRerr] oi'der, tIi[OCRerr]is the text j)lirases [OCRerr]inforrnation
retrievd[OCRerr].l" and "retrieving information[OCRerr] both mapped to the same phrase concept. The phrases [OCRerr]vere
treated as a separate concept type (ctype) within an ludexed vector, and ha(1 their own dictionary
and juverted file separate from those of the single terms. The components of phrases rei[OCRerr]ained in
the single term ctype.
Determination of phrases took 5.S hours, fluding 4[OCRerr]7OO,OOO phrases occunuig in Dl at least
once. Of those phrases 15[OCRerr],OOO occllrre(1 at least 25 tii[OCRerr]ies. These phrases were theii put into a
dictionary and nse(] as controlled vocal) ii Iai'y for phrase[OCRerr] [OCRerr]vlien (101 hg the I Il(lexing of' D i+D2. The
single term indexing reiiiai ned exactly [OCRerr]".5 it. was in the siugle term run ( teri[OCRerr]i[OCRerr] occurring in phrases
were not removed from the vector).
Both the terms and phrases were giveli a `[OCRerr]natural" If * i(/f weight (le, the i(1f weight was based
on tile collection fre([OCRerr]uency of the plii'a.s:e itself' r[OCRerr]tlier tli[OCRerr]'in being a fiiuctioii of the i(1f values of the
single term components). The cosine norlllalization was kandled Iii the following way. [OCRerr]ll terms in
the vectors had their weight divided by' the cosine leiigth of the single term sub-vector iiistead of
the vector as a whole. Thus the weights of single terms in the final vector were exactly the same
as if phrases were not being used. Yt i'etrieval tilue, the effect of a. l)llrase iiiatch was divided by 2.
(The same effect could have l)een obtained l)y dividing the indexed weight of' a phrase l)y sqrt(2).)
Indexing the document collection with l)hrases took I 0.6 hours, creating an luverted file of 840
Nlbytes. The actual retrieval took 2105 C1[OCRerr]U secolI(ls for all 50 queries [OCRerr]`i.Ii(l again considerably
longer in elapsed time.
more complicated local match criteria was used for the phrase rim. J;'he basic threshold was
reduced from 100.0 (used in the single tei'm run) to TS.0, but an additional restriction was placed
on the match to ensure that no one term woltid contril)llte more than (15 percent of the computed
pairwise sentence similarity for any Sd tence pal I'. Uhis effectivel V eliminates sentence matches due
to the presence of a. single highly weighted term. TIle ().`5V() was determined enipirically from tests
using the learning query/docunient sets [OCRerr])ercentages railgi ng from 55V([OCRerr] to `,5V( perforuted equally
well.
The more complicated local matclt, iii conjunction with tl[OCRerr]e phrases 1)10(1 uced a. very significant
improvement in the phrase run as opposed to the single- teim ni ii. The II -point average over 50
queries for the single-term run was 0.17:38 while the phrase run did very well at 0.2032.
Offical Routing Queries
Standard SMART relevance feedback tecliniq nes were Used to automatically con sti'ii ct routing
queries to be run on the test set of documeuts ( D2).
Each routing query was coniposed of' ternis from the original indexed query plus the "best" 30
terms fi'om the documents in the learning set that were relevant to that query. The weight of each
routing query term was a linear combiiiation of the tf x idf weight in the origiiial query, the tfxidf
weight in each of the relevant documents, and the t.fxidf weight in a. single non-relevant document.
E. Ide's [4,10] feedback formula was used:
= QQId { [OCRerr]( (17q) -
rci
The idf component in the query and document weights was based on occurrences of the term in
the learning set of documents only.
64