SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
approximately 5 minutes per topic. All additional steps were performed
automatically.
d. brief descriptioll of methods used (See al)()Ve.)
5. other d[OCRerr]ita structures buili from TREC text (what?)
Each i'REC docunlelit had to l)e f(Jrmatted for CLARIT processing, hy making the
uni([OCRerr]ue text II) accessil)le to CLARIT as a special field and hy delimiting the
heginning and end ([OCRerr]f each text in a tile. Intermediate (hut unretained) files
generated in CLARIT processing include a tile of the words in each text, in their
original order, annotated with morphological categories. Other files contain the
output of the parser, as a list of NPs in the order in which they occurred in each
text. The parsed representation of the text was retained and used at all sul)se(Iuent
steps of pr('cessing.
a. total [OCRerr]Lin()unt of storige (megabytes)
Processing steps are piped through the system; intermediate files are not
retained. The parsed representation of all the texts takes up appr([OCRerr]imately
98% of the space occupied hy the original text.
b. total computer tilne to build ([OCRerr]Ipproxilnate number of hours)
The total time to transform the original 2-gigahytes of text into parsed text
takes ahout 10 real hours, with processing distrihuted over 5 machines.
C. is the pr('cess completely aut()In[OCRerr]1tic? Yes
d. brief description of methods used
A `lex' pr[OCRerr][OCRerr]gram was used to reformat the TREC text to CLARIT format.
The English m(Irph()l(Jgical analyzer is written in C, and utilizes the lexicon
of 97,000 items (mentioned ahove and further descrihed helow).
The n([OCRerr]un phrase parser, also written in C, uses the grammatical categories
supplied I)y the m([OCRerr]rph()l()gical analysis and an ATN-style rule set to extract
n[OCRerr][OCRerr]un phrases.
C. Data built from sources other th([OCRerr] [OCRerr]e iliput (ext
1. inte[OCRerr]('41ly-built auxili[OCRerr]uy tiles
a. domaili independeut or dolnaul specific (if two sep[OCRerr]Lrate files, please till outone set
of questions for e[OCRerr][OCRerr]ch tile) Domain independent
b. type of file (thesaurus, knowledge [OCRerr] lexicon, etc.)
c. total [OCRerr]un()unt of st()r[OCRerr]ge (megabytes)
CLARiT Lexicon (2 megahytes)
English -word statistics derived from the G rolier's Encycl([OCRerr]pedia (2
megahytes)
d. total number of concepts represented
97,000 words (CLARIT Lexicon)
139,(X[OCRerr][OCRerr] words ((;r()lier's list)
e. type of representatioli (trwnes, semantic nets, rules, etc.)
Lexicon: A sorted word list, giving for each word its possihle grammatical
categories and category-dependent normalization.
(;r()lier's: A list of words with distribution and frequency counts
f. tot[OCRerr]-tl computer tilne to build (approxu nate number of hours)
(1) if already built, how much tilne to modify ft)r TREC?
Already huilt--Not modified for TREC
g. total matiual tilne to build (approximate number of hours)
(1) if aheady built, how much time to modify for TREC?
Already huilt--Not modified fi)r TREC
Ii. use of `nanu[OCRerr]ll labor
(1) mostly `nanuŁ[OCRerr]ly built usin[OCRerr]' speci[OCRerr][OCRerr] interface
(2) mostly machiuc built wi[OCRerr] manu[OCRerr][OCRerr] con-ection
(3) initi[OCRerr]d core m[OCRerr]mu[OCRerr]-illy built to "bootstrap" for completely machine-built
497