SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Automatic Retrieval With Locality Information Using SMART chapter C. Buckley G. Salton J. Allan National Institute of Standards and Technology Donna K. Harman [OCRerr]he routing query was then r[OCRerr]u ag([OCRerr]iust e[OCRerr].cIi of the doctiiiieiit.[OCRerr] in the test set. i?hose documeuts were judexed in the standar(1 [OCRerr] f[OCRerr]stiion[OCRerr] with e[OCRerr].cIi lerni receivilig [OCRerr]1 if.[OCRerr]idf weight, cosine norm[OCRerr]ization. Again the i(1f document weight was deterllliue(i by the occtii[OCRerr]'.ences of tile terni in the learning set of docuineuts onlv. [OCRerr][OCRerr]hus no collectioii information fron[OCRerr] the test set of documents was used. it took 306 secon(Is to coiisti[OCRerr]t[OCRerr]ct 1.1l(' fii II fee(lback query set ( most of the ti Inc s.1)eIlt deciding whiCh terms shonid 1)C added to each query ). It took 1.9 hours to index D2, forming an inverted file, and then 293 seconds to mu the .50 reformulated (111C!iC5 against the ii[OCRerr]verted file. Eflectiveness of this simple nletho(I was reaso[OCRerr]able 1)IIt not spectacular. `[lie lipoint average over 50 queries was 0.1924. Tradeoff runs This set of runs provide an examination of some of the tradeoffs (disk sI)ace, memory[OCRerr] time, and effectiveness) enco[OCRerr]ii tered within a 51 `igle In forni ation retrieval system. 1[OCRerr]liei[OCRerr]e are I lially decisions that need to be niade when desigiung (I syslein. the goal iii this set. of runs is to exl)lore the conse- quences of some fn n(l (imeiltal choices hid il(Ii ug stol)wor(ls. steni I[OCRerr]l i hg., 1)h rases, an(l term weiglitiug. Conceptually the stan(lard SNi \ 1[OCRerr]Tin(lexiIlg [OCRerr]11(l retrieval algoritli ins (liC giveli below. INDEXING For each document/query text 1.1 Break the text into tokens 1.2 Determine if token is a common word (stopword) to be discarded. 1.3 Stem all remaining tokens to their root forms. 1.4 Assign concept numbers to each root, forming a ``vector'[OCRerr] of concepts. 1.5 Weight each term in the vector. 1.6 Store vector in an inverted file RETRIEVAL For each term in the query 2.1 Get the inverted list of documents containing that term. 2.2 For each document/weight on inverted list 2.3 Add Qi * Di to the partial similarity computed for this term so far (Qi is this terms query weight and Di the document weight) 2.4 Add document to current list of top documents if similarity is high enough. of top documents to the user. Return list 65