SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Retrieval Experiments with a Large Collection using PIRCS
chapter
K. Kwok
L. Papadopoulos
K. Kwan
National Institute of Standards and Technology
Donna K. Harman
14. Kwok, K.L (1986). An interpretation of index term weighting schemes based on document
components. Proc. 1986 ACM Conf. on R&D in IR. F. Rabitti, ed. ACM: NY, pp.275-283.
15. Kwok, K.L & Kuan, W (1988). Experiments with document components for indexing and retrieval.
Inform. Proc. Mgmnt. 24:405417
16. Salton, G & Buckley, C (1990). Improving retrieval performance by relevance feedback. J. of ASIS.
41:288-297.
Appendix
System Summary and Timing
I. Construction of indices, knowledge bases, and other data structures
(please describe all data structures that your system needs for searching)
A. Which of the following were used to build your data structures
1. Stopword List YES
a. how many words in list? 595
2. is a controlled vocabulary used? NO
3. stemming
a. standard stemming algorithms
which ones?
b. morphological analysis
4. term weighting
5. phrase discovery
a. what kind of phrase?
b. using statistical methods
C. using syntactic methods
6. syntactic parsing NO
7. word sense disambiguation NO
8. heuristic associations NO
a. short definition of these associations
9. spelling checking (with manual correction) NO
10. spelling correction NO
11. proper noun identification algorithm NO
12. tokenizer (recognizes dates, phone numbers, common patterns)
NO
YES
PORTER'S ALGORITHM
NO
YES
NO
a. which patterns are tokenized?
13. are the manually-indexed terms used? NO
14. other techniques used to build data structures (brief description)
A TABLE OF 3% MANUALLY CREATED 2-WORD PHRASES. WHEN THESE ARE IDENTIFIED
IN ADJACENT POSITIONS IN DOCUMENTS OR QUERIES THEY ARE USED AS ADDITIONAL
INDEX TERMS.
B. Statistics on data structures built from ThEC text
(please fill out each applicable section)
1. inverted index
a. total amount of storage (megabytes) 372
b. total computer time to build (approximate number of hours)
95+11+2=108 for 500MB.
CLOCK TIME
YES, IF SUFFICIENT DISK.
NOT IN THIS EXPERIMENT.
c. Is the process completely automatic?
165