NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman approximately 5 minutes per topic. All additional steps were performed automatically. d. brief descriptioll of methods used (See al)()Ve.) 5. other d[OCRerr]ita structures buili from TREC text (what?) Each i'REC docunlelit had to l)e f(Jrmatted for CLARIT processing, hy making the uni([OCRerr]ue text II) accessil)le to CLARIT as a special field and hy delimiting the heginning and end ([OCRerr]f each text in a tile. Intermediate (hut unretained) files generated in CLARIT processing include a tile of the words in each text, in their original order, annotated with morphological categories. Other files contain the output of the parser, as a list of NPs in the order in which they occurred in each text. The parsed representation of the text was retained and used at all sul)se(Iuent steps of pr('cessing. a. total [OCRerr]Lin()unt of storige (megabytes) Processing steps are piped through the system; intermediate files are not retained. The parsed representation of all the texts takes up appr([OCRerr]imately 98% of the space occupied hy the original text. b. total computer tilne to build ([OCRerr]Ipproxilnate number of hours) The total time to transform the original 2-gigahytes of text into parsed text takes ahout 10 real hours, with processing distrihuted over 5 machines. C. is the pr('cess completely aut()In[OCRerr]1tic? Yes d. brief description of methods used A `lex' pr[OCRerr][OCRerr]gram was used to reformat the TREC text to CLARIT format. The English m(Irph()l(Jgical analyzer is written in C, and utilizes the lexicon of 97,000 items (mentioned ahove and further descrihed helow). The n([OCRerr]un phrase parser, also written in C, uses the grammatical categories supplied I)y the m([OCRerr]rph()l()gical analysis and an ATN-style rule set to extract n[OCRerr][OCRerr]un phrases. C. Data built from sources other th([OCRerr] [OCRerr]e iliput (ext 1. inte[OCRerr]('41ly-built auxili[OCRerr]uy tiles a. domaili independeut or dolnaul specific (if two sep[OCRerr]Lrate files, please till outone set of questions for e[OCRerr][OCRerr]ch tile) Domain independent b. type of file (thesaurus, knowledge [OCRerr] lexicon, etc.) c. total [OCRerr]un()unt of st()r[OCRerr]ge (megabytes) CLARiT Lexicon (2 megahytes) English -word statistics derived from the G rolier's Encycl([OCRerr]pedia (2 megahytes) d. total number of concepts represented 97,000 words (CLARIT Lexicon) 139,(X[OCRerr][OCRerr] words ((;r()lier's list) e. type of representatioli (trwnes, semantic nets, rules, etc.) Lexicon: A sorted word list, giving for each word its possihle grammatical categories and category-dependent normalization. (;r()lier's: A list of words with distribution and frequency counts f. tot[OCRerr]-tl computer tilne to build (approxu nate number of hours) (1) if already built, how much tilne to modify ft)r TREC? Already huilt--Not modified for TREC g. total matiual tilne to build (approximate number of hours) (1) if aheady built, how much time to modify for TREC? Already huilt--Not modified fi)r TREC Ii. use of `nanu[OCRerr]ll labor (1) mostly `nanuŁ[OCRerr]ly built usin[OCRerr]' speci[OCRerr][OCRerr] interface (2) mostly machiuc built wi[OCRerr] manu[OCRerr][OCRerr] con-ection (3) initi[OCRerr]d core m[OCRerr]mu[OCRerr]-illy built to "bootstrap" for completely machine-built 497