SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
d. brief descriptioll of Illetliods used
count word and n(Jn-w([OCRerr]rd fre(Iuencies using splay tree
other data structures built from [OCRerr]`REC text (what?)
a single file (jf the text itselt; compressed
a. loLil [OCRerr]ufl()uflt of sI()r[OCRerr]I[OCRerr]e (incgabytes) 253.2 MI)
b. to[OCRerr]l computer time to build (approxilnate number of hours) 3.10 cpu hours
C. is the process completely automatic? yes
d. brief description of methods used
zero-order w()rd-l)ased model using Huffman c(KIing
other data structures built from TREC text (what?)
a file of document addresses and document lengths (f[OCRerr])r cosine)
a. total [OCRerr]lin()uflt of st()r[OCRerr]i[OCRerr]e (megabytes) 1.8 Ml)
b. total computer Ijine to build (approxilnate iiumber of hours) negligil)le
other data structures built from TREC text (what?) vocal)ulary for inverted index
a. total [OCRerr]unount of stor[OCRerr][OCRerr][OCRerr]e (megabytes) 3.6 Ml)
b. toL[OCRerr]l computer tilne to build (approxilnate number of hours) 2.41 cpu hours
C. is the process completely automatic? yes
d. brief description of Ineth(xls used count stemmed w[OCRerr][OCRerr]rd fre(luencies using splay tree
other da[OCRerr] structures built from TREC text (what?) a file of inverted index entry addr[OCRerr]sses
a. toLd [OCRerr]unount of st()r£1[OCRerr]e (me(Tabytes) 1.2 Ml)
b. total computer tilne to build (approxilnate number of hours) negligil)le
other data structures built from TREC text (what?)
a file of approximate document lengths
a. total (unount of storaLTe (megabytes) 0.2 Ml)
b. total computer tilne (0 build (approxilnate number of hours) negligihle
C. Data built from sources other th(w the iuput text --no
II. Query construction
(please fill out a section for each query construction method used)
A. Automatically built queries (ad hoc)
1. topic fields used all
2. toL[OCRerr] computer tilne to build query (cpu seconds) less than one second
3. which of the ft)llowin([OCRerr] were used?
a. tenn weightin[OCRerr] witli weights k[OCRerr]ed on tenns in topics yes, as in cosine measure
j. other (describe)
used stop words to eliminate comnion words from query
eliminated SGML tags and all punctuation
III. Searching
A. Total computer tilne to scaich (cpu seconds)
I & 2 were not timed separately; 35 seconds per query to identify the top 2(10 ranked items
further 4.6 seconds of cpu decompress the top 200 items, 18.6 seconds in total including
retrieval time
1. retrieval time (total cpu seconds between when a query enters the system until a list of
document numbers [OCRerr] obt'[OCRerr]ined)
2. rankin[OCRerr] time (t()tal cpu seconds to sort d([OCRerr]ument list)
488