SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman Retrieval Experiments with a Large Collection using PIRCS K.L. Kwok, L. Papadopoulos and Kathy Y.Y. Kwan email:kklqc@cunyvm.bitnet Computer Science Department1 Queens College, CUNY flushing, NY 11367 ABSTRACT Our strategy to Information Retrieval and to the TREC experiments is based on techniques that have previously been demonstrated to work for small to medium size collections: 1) use of document components for retrieval and term weighting; 2) two-word phrases to achieve better precision and recall; 3) combination of retrieval methods, and 4) network implementation with learning capability to support feedback and query expansion. Evaluation shows that we return the best results for Category B in both ad hoc and muting retrievals, and our approach is comparable to the best methods used in Category A experiments. It appears techniques that work for small collections such as combining soft-boolean retrieval with probabilistic model, user relevance feedback, and feedback with query expansion also work for this large collection. 1. Introduction Over the past years, we have built an experimental system for automated information retrieval (IR) called PIRCS, acronym for Probabilistic Indexing and Retrieval - Components - System. It provides storage and retrieval capability based on collection statistics using content terms as index terms for document representation. Its design is based on a network, and has flexibility in mind more than efficiency or performance, so that future and unforseen approaches to IR may be supported. As it turns out, when the TREC experiments were announced, we found that our system fits nicely with the requirements. Certain changes (mostly because of large file sizes) and a number of peripheral programs are needed, but the basic design remains intact. Because available RAM and disk space are limited, we can only handle Category B Wall Street Journal files (WSJ 500 MByte raw data); however, we forsee no problem running the system for Category A files (2 GByte raw data) if we have the appropriate hardware. In what follows we shall describe our strategies for IR and TREC in Section 2, system description in Section 3 and the Appendix, results and discussions in Section 4, and conclusions in Section 5. 2. Strategies for IR and TREC Experiments Over the past twenty five or 50 years, a number of techniques have been known to work in IR for small to medium collections. These include: manually created thesauri and phrases to enhance recall and precision; term weighting that can account for importance of content and discrimination; user relevance feedback and feedback with query expansion; combining multiple retrieval methods, and using multiple document and query representations. Many other methods exist and have been experimented with, (such as clustering, natural language processing, various artificial intelligence techniques, etc.) but their effectiveness in general are still in question. Our strategy to IR and to the TREC experiments is based on the following that reflect on the previous workable techniques: 1) use of document components; 2) two-word phrases; 3) combination of retrieval methods, and 4) network implementation with learning. We shall discuss how and why each of these methodologies may help contribute to the effectiveness of our results. 153