SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
vPI & sU
General Coininenis
The timings should be the tilne to replicate runs from scratch, not including trial run 5, etc. The t[OCRerr]es should also
be re[OCRerr]'tsonably accurate. This soifletitnes Will be dimcult, such £[OCRerr] getting total time ft)r document indexing of huge
text sections, or `naliu[OCRerr]'tlly building a k'iowledge b[OCRerr]'Lse. Please do your best.
I. Construction of indices, knoWledge bises, and other data structures (ple[OCRerr][OCRerr]se describe all data structures that
your system needs for se[OCRerr]'u'chi'ig)
A. Which of [OCRerr]e folloWing were used to build y[OCRerr][OCRerr]ur data structures'!
1. st()pW()rd list yes
[OCRerr]I. 110W In my Words in list'? 41 [OCRerr]
2. is LI controlled v([OCRerr]('LbulL'n'y used? 11([OCRerr]
3. stCIfl111int[OCRerr] [OCRerr]
4. teilli Weighting
Vector and p-n([OCRerr]rni runs were d[OCRerr][OCRerr]ne with n([OCRerr] term weights. Vector runs were aLso
perf[OCRerr])rIned with aug[OCRerr]n([OCRerr]rn1 * idf weighting.
5. phrase discovery no
6. syntactic p('[OCRerr]sing 110
7. Word sense dis[OCRerr][OCRerr]nbiguati()n 11([OCRerr]
8. heuristic [OCRerr]L[OCRerr]s()ciati()ns Ilo
9. spelling checking (With manual collection) no
lo. spelling correction 110
11. proper IIOUll identific£'Iti()n £`Ilg()ri[OCRerr]ln As pr(Jvided in SMART
12. tokenizer (recogil izes dItes, phone numbers, C()I[OCRerr]()11 patterils) As provided in SMART
13. £`ire the in[OCRerr]'uiu£'dly-indexed terms used'! not used as suggested in guidelines
14. other techniques used to build d[OCRerr]It[OCRerr]'I structures (brief descuption)
1983 VerSIon of SMART, enhanced with VPI&SU routines
B. SL[OCRerr]tistics on [OCRerr]Ita structures built fR)In TREC text (ple[OCRerr]'Ise fill out each applicable section)
Except it you want Us to answer under 4 here re the knowledge l)ase used to help Iluild our
Boolean (lueries, please advise.
5. other [OCRerr]It[OCRerr]'I structures built from TREC text (what?) Document vector tile and term dictionary
LI. toLIl £`u'1()unt of storige (IllegIbytes)
Approx. 15Nil[OCRerr] t[OCRerr])r the dictionary and 121 l,IB for the Document vector file
for the entire [OCRerr]Vall Street journal collection.
b. told computer tune to build (`Ippr()xiln[OCRerr]'Ite number of hours)
Approx. time t([OCRerr] build above lo hours ([OCRerr]n ccrdl (DECstation 5([OCRerr]N[OCRerr] Model 25,
i.e., a MIPS R3([OCRerr]()([OCRerr] chip running at 2SMHz)
C. is the pr('cess completely automatic'! yes
d. brief description of methods used
The document text is tokenized, stop words are thrown out, and non-noise
words are kept in the term dictionary along with its occurrence frequency.
Each term ill the dictionary has a unique identitication numller. The vector
tile contains for each document its unique ID, and a vector of term ID and
weights for the term. The weighting scheme is flexulle and can l)e changed
to ()flC ([OCRerr]f several schemes after the indexing is complete. (If necessary we
can till in details here. Please advise.)
510