SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Automatic Retrieval With Locality Information Using SMART
chapter
C. Buckley
G. Salton
J. Allan
National Institute of Standards and Technology
Donna K. Harman
[OCRerr]he routing query was then r[OCRerr]u ag([OCRerr]iust e[OCRerr].cIi of the doctiiiieiit.[OCRerr] in the test set. i?hose documeuts
were judexed in the standar(1 [OCRerr] f[OCRerr]stiion[OCRerr] with e[OCRerr].cIi lerni receivilig [OCRerr]1 if.[OCRerr]idf weight, cosine
norm[OCRerr]ization. Again the i(1f document weight was deterllliue(i by the occtii[OCRerr]'.ences of tile terni in
the learning set of docuineuts onlv. [OCRerr][OCRerr]hus no collectioii information fron[OCRerr] the test set of documents
was used.
it took 306 secon(Is to coiisti[OCRerr]t[OCRerr]ct 1.1l(' fii II fee(lback query set ( most of the ti Inc s.1)eIlt deciding
whiCh terms shonid 1)C added to each query ). It took 1.9 hours to index D2, forming an inverted
file, and then 293 seconds to mu the .50 reformulated (111C!iC5 against the ii[OCRerr]verted file.
Eflectiveness of this simple nletho(I was reaso[OCRerr]able 1)IIt not spectacular. `[lie lipoint average
over 50 queries was 0.1924.
Tradeoff runs
This set of runs provide an examination of some of the tradeoffs (disk sI)ace, memory[OCRerr] time, and
effectiveness) enco[OCRerr]ii tered within a 51 `igle In forni ation retrieval system. 1[OCRerr]liei[OCRerr]e are I lially decisions
that need to be niade when desigiung (I syslein. the goal iii this set. of runs is to exl)lore the conse-
quences of some fn n(l (imeiltal choices hid il(Ii ug stol)wor(ls. steni I[OCRerr]l i hg., 1)h rases, an(l term weiglitiug.
Conceptually the stan(lard SNi \ 1[OCRerr]Tin(lexiIlg [OCRerr]11(l retrieval algoritli ins (liC giveli below.
INDEXING
For each document/query text
1.1 Break the text into tokens
1.2 Determine if token is a common word (stopword) to be
discarded.
1.3 Stem all remaining tokens to their root forms.
1.4 Assign concept numbers to each root, forming a
``vector'[OCRerr] of concepts.
1.5 Weight each term in the vector.
1.6 Store vector in an inverted file
RETRIEVAL
For each term in the query
2.1 Get the inverted list of documents containing that term.
2.2 For each document/weight on inverted list
2.3 Add Qi * Di to the partial similarity computed
for this term so far (Qi is this terms query
weight and Di the document weight)
2.4 Add document to current list of top documents
if similarity is high enough.
of top documents to the user.
Return list
65