SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman rihis J)I.0CC55 is exl)Ccted to perforlil [OCRerr]fVectIve1y in l)ro(l Ilcilig' 1)0th high I)1Qcisiofl (`15 well as high recall. System Description [OCRerr]lie (ornell [OCRerr] experinlents use tile S[OCRerr]l[OCRerr]\ R[OCRerr] Infonuatioji l[OCRerr]etriev[OCRerr]'[OCRerr]j [OCRerr]` [OCRerr]1eln \`ersion 11 and `vere mu on a. dCdiC('1.tCd Sun Spaj'cs 2 with 64 Nibyt es of iiieiiiory a.[OCRerr]d 5 ( b[OCRerr] t(5 of local disk. S[OCRerr]i[OCRerr]\R'I' Version 11 is the latest in a. long line of (`XJ)erinlellt al inforinai iou retrieval sys te[OCRerr]s, dating l)ack over 30 ye developed under the gt'ida.nce of ([OCRerr]. Salton. \ ci [OCRerr]ion It is a reasonably coniplete re-write of earlier versions, a.n(l was designed a.ll(l co(le(l by ([OCRerr]. I[OCRerr]ucl,le[OCRerr] flie new version is approximately 41,000 lines of (` code and doctinientation. SNi[OCRerr][OCRerr]RT Version 11 offers a. basic fl'a.lnewol.k for invesl.iga.tiolls lilto the vector sJ)ace and related niodels of information retrieval. f)ociiinelit[OCRerr] are ftilly ailtoinaticaily mdc xed, wit Ii each documejit representation being a weighted vector of concel)ts, with the weight indicatill')' the ill1l)ortance of a Concept to that particnla.r document. 7['lie dociiiiient represelltatives are physically stored on disk as ah iliverted file. Nattiral lalignage (111Cl'iC5 will go through the sanie in(lexing l)i'Ocess. `tile (Iliery re1)resentative vector is theii coInl)ared with 1. lie iiidexed docilnient i.el)reseiitatives to arrive at a. sinlilarity. The docninei[OCRerr]t[OCRerr] are then fiilI[OCRerr]' ran ke(l l)y siiiiilarjtv. Specific Methodology Used for TREC Sttidy `i'liere are two major sets of (1.oriiell [OCRerr][OCRerr]REC'. expeniucittal 1.11115. rilie first set. is the official `fi[OCRerr]EC. set with ad-hoc runs usilig the locai/glol)al iii atchi ng l)roce(l ii I.e deScri l)ed above in steps I -4. There are two automatic I'll ns ill this set; olie ilsilig only siugle terili iil(lexiilg and the secolid lisilig both siugle term and two term phrases. `i'hei'e is also an official lolitilig run lisilig a. Si niple relevance feedback techni(jue to form a. revised (fuery l)ased oil relevance jiidgeiiieiits l'i'oin the training set. The other set of runs provide a.n exa.lniliatioii of some of the tra.(leofL[OCRerr] (disk space memory, time, and effectiveness) elicountered withi [1 a. siugle iiiforina,tion retrieval systeiii. There a.re manv decisiojis that need to be made wheii (lesigning a. systeni; the goal in this set of 1.11115 is to explore the coliseqilences of some fu[OCRerr]dainentaI choices includuig 1. I)egree of stemming 2. Size of stopword list 3. luverse document freq weighting 4. Phrases 5. Query Optiinizatioii Both sets of runs use completely a.lltoiiia,tic indexing of [OCRerr]jlieries and dociiiiients. Queries and documents are treated as flat text; some sections (like DO('ID) might be omitted, but all indexable text is treated the same (unfortuiiately, even if preceded by' a. NOT!). This ignores the structure (in both form and semantic meaning) of the queries which could be very useful. SV1[OCRerr][OCRerr]1[OCRerr]T has the capability to treat different parts of query, or documeilt iii differeut appropriate maimers. However, using this structure would have tremendously complicated the second set of runs by addiug another large set of variables 1.0 the experinlents. Now that the choices investiga.te(l in the second set of runs have been made, future runs can use the strncture of documeuts and [OCRerr]iueries 62