SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman Automatic Retrieval With Locality Information Using SMART C1hris 13ucktc[OCRerr]' c;(1'ar4 Salton aild James Allan Abstract The Smart project at Cornell University', using a completely antomatic [OCRerr]`ipproach for both rout- ing an(l ad-hoc experiments, performed extremely [OCRerr]velI in the first Text Retrieval Confereilce. The basic ad-hoc approach nses local/globc'4 i[OCRerr]atching to achieve its results. I\ global illatch ensures that e('lch retrieved document uses the s('lme vocai)lilai'y c'[OCRerr]s the ([OCRerr][OCRerr]iery; a loc('1i match then [OCRerr]`[OCRerr]ttenll)ts to guaxantee some local part of Ilie (locilmelit (eg. a par[OCRerr]gr('lpli Ol' seiltelice ) i'ocuises Oil the query topic. l[OCRerr]nns [OCRerr]vere niade with and [OCRerr]vitIiout siiiIl)le adjaceucy phrases. \ sinil)le relevance feedback algorithm is nsed for routilig ex1)erillielits; I lie origin[OCRerr]'[OCRerr] (jilely' is exp[OCRerr]'in(le(l hy. terms occil i.ring in relevaut documents, [OCRerr]vith terul [OCRerr]veiglits I)eing based ii 1)011 oCclIi.i.en('(' ill the i.elev[OCRerr]'int docti ments. In additioii, a set of system design issnes ail(l tradeofi; aic exanimed. Introduction For over 30 years, the Smart project at Coruell has been iiiterested in tile analysis, search, and retrieval of heterogeucous text databases, where the voc('i.l)lilary is allowed to vary [OCRerr]vid ely, and tile subject matter is unrestricted. Such databases may in ci I1(le newspaper articles. `iewswire disl)atclies. textbooks, dictionaries and encyclopedias, niaiiuals, niagazine articles, and so ou. tl'lie usual text analysis and text indexing approaches that are based oil tile use of thesauruses aud other vocabulary control devices are difficult to al)plv iii un restricted text. en vi lolimeuts, because t lie word nicanings are not stable in such circumstances aud the iliterl)retation varies dependiug on coiitext. The applicability of more complex text analvsis systems that a.i.e based on the coustruction of kuowledge ba.ses covering the detailed structure of particiil('ir sul)ject areas, together with infereuce rules designed to derive relationships betw('.eIl the relevaut concepts, is eveil more questionable iii such cases. Complete theories of knowledge representation (10 not exist, and it is unclear what concepts, concept relationships and inference rules may' be ileeded to understaii(l particlilar texts.[5] Accordingly, a text analysis and retrieval coinpoiieiit must necessarily be l)ase(J j)rin)a.rily on a. study of the available texts thejuselves. l[OCRerr])rttinateIy vei'y large text (lat ab [OCRerr] aic now available in machine- readable form, and a. 5 n bstaiitial amount of in fori iiation is automatically derivable about the occurrence properties of wor(ls and expressions iii iiatnral-laiigiiage texts, and about the contexts in which the words are used. This informa.t.ioii can help in determining whether two or niore texts are semantically homogeneous. that is, whei.her they cover similai' subject areas. VVhen that is the case, such semantically lioniogeneous texts caii be liliked, thereby generating an automatic structured text. (hypertext ) I'eI)i.eseiitatioil; alternatively, in a retrieval setting a text can be retrieved when another semantically homogeneous text is sul)niitted as a. query. *Depart,ment of CoiiIJ)Iiter Scieiice, Coijiell Univeisity, Ithaca.. N'i" 14X..53-7501 . [OCRerr][OCRerr][OCRerr]iis. siudy [OCRerr]vas sIil)i)ort.e(1 iii part by the Natioiial Scieiice Fon'idation under gralit IRi 89-i5847. 59