SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
algorithm re(luires that at least two original text words agree on a synonym hefore
it is added to the vector.
The effect ([OCRerr]f this is to do a [OCRerr]()()[OCRerr] man's version of sense disaml)iguati()n for the
synonyms.
8. heunstic Ł[OCRerr][OCRerr]s()ciŁ'ttions
a. short definition of these [OCRerr]sociations
W()rdNet synonymy relation only association used.
9. spelling checking (with inatiull con-ection) no
10. spelling correction no
11. proper i)OUII identificition [OCRerr]`tlgori[OCRerr]in no
12. tokenizer (reco(,'nizes d[OCRerr]-[OCRerr]tes, phone numbers, coi[OCRerr]on pattenis) no
13. Łu-e the mL[OCRerr]1u[OCRerr]-tlly-indexed terms used? no
14. other techniques used to build dŁ'Lta structures (brief description) no
B. Statistics on data structures built from FREC text (please fill out each applicable section)
1. inverted index
a. total [OCRerr][OCRerr]ount of st()r[OCRerr][OCRerr]2e (inegŁ-[OCRerr]b ytes) 947 megal)ytes ([OCRerr]f disk storage
b. total computer time to build (appr()xilnate number of hours)
5 hours to l)uild index given document vectors; document vectors took 37
hours t(J l)uild from text. Thus, approximately 42 hours to go from text to
inverted index.
C. is the [OCRerr]R[OCRerr]C55 completely [OCRerr]-1ut()In[OCRerr]ttic? yes
d. [OCRerr]u-e terin positions wi[OCRerr]iii d(Xulnents stored'?
No term position information maintained.
e. single tCrins olily?
Single terms only (although, as stated al)()ve, a single term from WordNet
may l)e a collocation such as `electrical_discharge').
2. n-grŁ-uns, suffix aiTays, siL'nature tiles
N-grams and signature tiles not used. SMART stemmer algorithm incorporates a
(static) trie of suffixes.
3. knowledge bases
No knowledge l)ase used other than W()rdNet (descril)ed under I.C.2).
C. Data built from sources other th[OCRerr]-[OCRerr] [OCRerr]e input text
1. inteni[OCRerr]i]ly-built auxiliai-y tiles Il()[OCRerr]C
2. externuly-built [OCRerr]-[OCRerr]uxili[OCRerr]'u-y lile
a. type of tile [OCRerr][OCRerr]-eebank, \V()rdNet, etc.) W()rdNet (noun portion only)
b. t()tL[OCRerr]l
c. total
d. type
[OCRerr][OCRerr]()uI1t of st()r-t[OCRerr]'e (IneLT-Ibytes) 5 megal)ytes
number of concepts represented 35,155 syn[OCRerr][OCRerr]nym sets (67,293 word senses)
of represeflt(-iti()Il (fr(-[OCRerr]es, ,[OCRerr]eIn[OCRerr]-[OCRerr]tic nets, rules, etc.)
We used only the syn[OCRerr][OCRerr]nymy relation that W()rdNet contains. However,
W()rdNet contains many other lexical relationships making it similar to a
semantic net.
II. Query construction
(please fill out a section 1;()r etch query colisti-uction method used)
[We sul)mitted oliC set of results; those results were for automatically huilt ad hoc (lueries.]
A. Autom[OCRerr]-ttic[OCRerr]tlly built queries (ad hoc)
1. topic fields used
C([OCRerr]ncepts (<con>), Description (<desc>), Factors (<fac>), Narrative (<narr>),
Nationality (<nat>), Title (<title>)
2. total computer titne to build query (cpu seconds)
1 second, [OCRerr] average (5[OCRerr]) seconds to index 50 (lueries)
517