SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman 14. Kwok, K.L (1986). An interpretation of index term weighting schemes based on document components. Proc. 1986 ACM Conf. on R&D in IR. F. Rabitti, ed. ACM: NY, pp.275-283. 15. Kwok, K.L & Kuan, W (1988). Experiments with document components for indexing and retrieval. Inform. Proc. Mgmnt. 24:405417 16. Salton, G & Buckley, C (1990). Improving retrieval performance by relevance feedback. J. of ASIS. 41:288-297. Appendix System Summary and Timing I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for searching) A. Which of the following were used to build your data structures 1. Stopword List YES a. how many words in list? 595 2. is a controlled vocabulary used? NO 3. stemming a. standard stemming algorithms which ones? b. morphological analysis 4. term weighting 5. phrase discovery a. what kind of phrase? b. using statistical methods C. using syntactic methods 6. syntactic parsing NO 7. word sense disambiguation NO 8. heuristic associations NO a. short definition of these associations 9. spelling checking (with manual correction) NO 10. spelling correction NO 11. proper noun identification algorithm NO 12. tokenizer (recognizes dates, phone numbers, common patterns) NO YES PORTER'S ALGORITHM NO YES NO a. which patterns are tokenized? 13. are the manually-indexed terms used? NO 14. other techniques used to build data structures (brief description) A TABLE OF 3% MANUALLY CREATED 2-WORD PHRASES. WHEN THESE ARE IDENTIFIED IN ADJACENT POSITIONS IN DOCUMENTS OR QUERIES THEY ARE USED AS ADDITIONAL INDEX TERMS. B. Statistics on data structures built from ThEC text (please fill out each applicable section) 1. inverted index a. total amount of storage (megabytes) 372 b. total computer time to build (approximate number of hours) 95+11+2=108 for 500MB. CLOCK TIME YES, IF SUFFICIENT DISK. NOT IN THIS EXPERIMENT. c. Is the process completely automatic? 165