SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
2.9 hours to reweiglit doc vectors and pr(KIuce inverted tile
C. is the pr([OCRerr]ess coiiipletely aut()In£'Itic? yes
d. [OCRerr]`ire term positions wi[OCRerr]in d(X'ulnellts stored? no
e. single terins only? Ilo
5. other data structures built from TREC text (what?)
Map from d([OCRerr]id t([OCRerr] text location (also gives title f([OCRerr]r each dE)c)
a. total ainoulit of storuge (niegabytes) 68 Ml)ytes.
b. total computer tilne to build (approxu nate number of hours)
Time t([OCRerr] create included in inverted tile creation al)()ve.
C. is the pr(xess completely aut()Jn£itic? yes
other data structures built from TREC text (what?)
Map from internal concept to t[OCRerr][OCRerr]ken string
a. total [OCRerr]unount of stor£ige (megabytes) 25 Ml)ytes
b. total computer tilne to build (approxiznate number of hours)
Time to create included in inverted tile creation ahove.
C. is the pr([OCRerr]ess completely automatic? yes
other data structures built from TREC text (what?)
Phrase dictionary (controlled v([OCRerr]al)ulary)
Phrases were adjacent n()n-stopw()rds, components stemmed, that occurred at least
25 times in the Dl document set.
[OCRerr]i. total unount of stor[OCRerr]ge (me[OCRerr]abytes) 14 Ml)ytes to store dictionary.
b. total computer tillie to build (approx[OCRerr][OCRerr]'ite number of hours)
It took 5.8 hours to index Dl, finding [OCRerr] phrases and their collection
stats. Ot those phrases l58,([OCRerr]()() ([OCRerr]curred at least 25 times.
C. is the [OCRerr]r(icC55 completely automatic?
C. Data built from source5 other thul the input text
None, ([OCRerr]ther than st()pw()rd tile.
II. Query construction
(please fill out a section for each query construction method used)
A. Autx)lnatically built queries (ad hoc)
1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description
2. total computer tilne to build query (cpu seconds) 2.7 seconds
3. which of the f[OCRerr])llowing were used?
a. term weighting with weights b[OCRerr][OCRerr][OCRerr]ed on terms in topics (idf)
b. phrase extraction from topics yes, using controlled list of phra[OCRerr]es
III. Searching
A. Tot[OCRerr][OCRerr] computer tilne to search (cpu seconds)
374 seconds (includes retrieval + ranking).
1. retrieval tilne (total cpu seconds between when a query enters the system until a list of
document numbers al-c obtained)
2. railking time (total cpu seconds to sort d('cument list)
B. Which methods best describe y[OCRerr][OCRerr]ur machine searching methods?
1. vector space m(XIel
2. probabilistic model
C. What factors cLrC included in your ranking?
459