SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman 1. SUBDOCUMENT FILE 3. DOCID CHECKING FILE 5. DOCNUM FILE 7. DIRECT FILE 9. NODE FILE a. total amount of storage (megabytes) 1.481 3.7 5.11 7.372 2. CODED FILE 4. TERMID CHECKING FILE 6. TERMNUM (DICTIONARY) FILE 8. INDEX TO DIRECT FILE 10. EDGE FILE 2.324 4.4 6.6 8.19 9. 4x14 10. 4x9 SYSTEM WAS DEVELOPED FOR EXPERIMENTAL GENERATE OTHER DATA. SOME OF THE FILES ARE RESEARCH, WITh FLEXIBILITY TO NOT NECESSARY FOR RETRIEVAL. b. total computer time to build (approximate number of hours) 1. 1.5 2,3,4,5,6. 95 7,8. 11 9,10. 4x0.25=1 C. is the process completely automatic? YES IF SUFFICIENT RAM AND DISK SPACE. FOR THIS EXPT, NO. if not, approximately how many hours of manual labor? 2 d. brief description of methods used RAW TEXT --> SUBDOCUMENT FILE SUBDOCUMENT --> CODED FILE, DOCID FILE, TERMID FILE DOCNUM FILE, TERMNUM (DICTIONARY) FILE. ZIPF-LAW PROGRAM TRUNCATES DI[OCRerr]ONARY VIA USER ASSIGNED LIMITS. CODED, TERMNUM --> DIRECT FILE with INDEX DIRECT --> INVERTED FILE DIRECT, INVERTED --> NODE, EDGE FILES. C. Data built from sources other than the input text 1. internally-built auxiliary files a. domain independent or domain specific (if two separate files, please fill out one set of questions for each file) b. type of file (thesaurus, knowledge base, lexicon, etc.) c. total amount of storage (megabytes) d. total number of concepts represented DOMAIN SPECIFIC WORD PAIR 0.005 396 e. type of representation (frames, semantic nets, rules, etc.) f. total computer time to build (approximate number of hours) (1) if already built, how much time to modify for TREC? 0 C[HIS IS A FILE CREATED VIA EDITOR1). g. total manual time to build (approximate number of hours) 16 (1) if already built, how much time to modify for TREC? 167