SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
Automatic Retrieval With Locality Information Using
SMART
C1hris 13ucktc[OCRerr]' c;(1'ar4 Salton aild James Allan
Abstract
The Smart project at Cornell University', using a completely antomatic [OCRerr]`ipproach for both rout-
ing an(l ad-hoc experiments, performed extremely [OCRerr]velI in the first Text Retrieval Confereilce. The
basic ad-hoc approach nses local/globc'4 i[OCRerr]atching to achieve its results. I\ global illatch ensures
that e('lch retrieved document uses the s('lme vocai)lilai'y c'[OCRerr]s the ([OCRerr][OCRerr]iery; a loc('1i match then [OCRerr]`[OCRerr]ttenll)ts
to guaxantee some local part of Ilie (locilmelit (eg. a par[OCRerr]gr('lpli Ol' seiltelice ) i'ocuises Oil the query
topic. l[OCRerr]nns [OCRerr]vere niade with and [OCRerr]vitIiout siiiIl)le adjaceucy phrases. \ sinil)le relevance feedback
algorithm is nsed for routilig ex1)erillielits; I lie origin[OCRerr]'[OCRerr] (jilely' is exp[OCRerr]'in(le(l hy. terms occil i.ring in
relevaut documents, [OCRerr]vith terul [OCRerr]veiglits I)eing based ii 1)011 oCclIi.i.en('(' ill the i.elev[OCRerr]'int docti ments. In
additioii, a set of system design issnes ail(l tradeofi; aic exanimed.
Introduction
For over 30 years, the Smart project at Coruell has been iiiterested in tile analysis, search, and
retrieval of heterogeucous text databases, where the voc('i.l)lilary is allowed to vary [OCRerr]vid ely, and tile
subject matter is unrestricted. Such databases may in ci I1(le newspaper articles. `iewswire disl)atclies.
textbooks, dictionaries and encyclopedias, niaiiuals, niagazine articles, and so ou. tl'lie usual text
analysis and text indexing approaches that are based oil tile use of thesauruses aud other vocabulary
control devices are difficult to al)plv iii un restricted text. en vi lolimeuts, because t lie word nicanings
are not stable in such circumstances aud the iliterl)retation varies dependiug on coiitext. The
applicability of more complex text analvsis systems that a.i.e based on the coustruction of kuowledge
ba.ses covering the detailed structure of particiil('ir sul)ject areas, together with infereuce rules
designed to derive relationships betw('.eIl the relevaut concepts, is eveil more questionable iii such
cases. Complete theories of knowledge representation (10 not exist, and it is unclear what concepts,
concept relationships and inference rules may' be ileeded to understaii(l particlilar texts.[5]
Accordingly, a text analysis and retrieval coinpoiieiit must necessarily be l)ase(J j)rin)a.rily on a.
study of the available texts thejuselves. l[OCRerr])rttinateIy vei'y large text (lat ab [OCRerr] aic now available
in machine- readable form, and a. 5 n bstaiitial amount of in fori iiation is automatically derivable
about the occurrence properties of wor(ls and expressions iii iiatnral-laiigiiage texts, and about
the contexts in which the words are used. This informa.t.ioii can help in determining whether two
or niore texts are semantically homogeneous. that is, whei.her they cover similai' subject areas.
VVhen that is the case, such semantically lioniogeneous texts caii be liliked, thereby generating an
automatic structured text. (hypertext ) I'eI)i.eseiitatioil; alternatively, in a retrieval setting a text can
be retrieved when another semantically homogeneous text is sul)niitted as a. query.
*Depart,ment of CoiiIJ)Iiter Scieiice, Coijiell Univeisity, Ithaca.. N'i" 14X..53-7501 . [OCRerr][OCRerr][OCRerr]iis. siudy [OCRerr]vas sIil)i)ort.e(1 iii part
by the Natioiial Scieiice Fon'idation under gralit IRi 89-i5847.
59