SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Retrieval Experiments with a Large Collection using PIRCS
chapter
K. Kwok
L. Papadopoulos
K. Kwan
National Institute of Standards and Technology
Donna K. Harman
1. SUBDOCUMENT FILE
3. DOCID CHECKING FILE
5. DOCNUM FILE
7. DIRECT FILE
9. NODE FILE
a. total amount of storage (megabytes)
1.481
3.7
5.11
7.372
2. CODED FILE
4. TERMID CHECKING FILE
6. TERMNUM (DICTIONARY) FILE
8. INDEX TO DIRECT FILE
10. EDGE FILE
2.324
4.4
6.6
8.19
9. 4x14 10. 4x9
SYSTEM WAS DEVELOPED FOR EXPERIMENTAL
GENERATE OTHER DATA. SOME OF THE FILES ARE
RESEARCH, WITh FLEXIBILITY TO
NOT NECESSARY FOR RETRIEVAL.
b. total computer time to build (approximate number of hours)
1. 1.5 2,3,4,5,6. 95
7,8. 11
9,10. 4x0.25=1
C. is the process completely automatic? YES IF SUFFICIENT RAM
AND DISK SPACE.
FOR THIS EXPT, NO.
if not, approximately how many hours of manual labor?
2
d. brief description of methods used
RAW TEXT --> SUBDOCUMENT FILE
SUBDOCUMENT --> CODED FILE, DOCID FILE, TERMID FILE
DOCNUM FILE, TERMNUM (DICTIONARY) FILE.
ZIPF-LAW PROGRAM TRUNCATES DI[OCRerr]ONARY VIA USER ASSIGNED LIMITS.
CODED, TERMNUM --> DIRECT FILE with INDEX
DIRECT --> INVERTED FILE
DIRECT, INVERTED --> NODE, EDGE FILES.
C. Data built from sources other than the input text
1. internally-built auxiliary files
a. domain independent or domain specific (if two separate
files, please fill out one set of questions for each file)
b. type of file (thesaurus, knowledge base, lexicon, etc.)
c. total amount of storage (megabytes)
d. total number of concepts represented
DOMAIN SPECIFIC
WORD PAIR
0.005
396
e. type of representation (frames, semantic nets, rules, etc.)
f. total computer time to build (approximate number of hours)
(1) if already built, how much time to modify for TREC?
0
C[HIS IS A FILE CREATED VIA EDITOR1).
g. total manual time to build (approximate number of hours)
16
(1) if already built, how much time to modify for TREC?
167