SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Retrieval Experiments with a Large Collection using PIRCS
chapter
K. Kwok
L. Papadopoulos
K. Kwan
National Institute of Standards and Technology
Donna K. Harman
Retrieval Experiments with a Large Collection using PIRCS
K.L. Kwok, L. Papadopoulos and Kathy Y.Y. Kwan
email:kklqc@cunyvm.bitnet
Computer Science Department1 Queens College, CUNY
flushing, NY 11367
ABSTRACT
Our strategy to Information Retrieval and to the TREC experiments is based on techniques that have
previously been demonstrated to work for small to medium size collections: 1) use of document
components for retrieval and term weighting; 2) two-word phrases to achieve better precision and recall;
3) combination of retrieval methods, and 4) network implementation with learning capability to support
feedback and query expansion. Evaluation shows that we return the best results for Category B in both
ad hoc and muting retrievals, and our approach is comparable to the best methods used in Category A
experiments. It appears techniques that work for small collections such as combining soft-boolean
retrieval with probabilistic model, user relevance feedback, and feedback with query expansion also work
for this large collection.
1. Introduction
Over the past years, we have built an experimental system for automated information retrieval (IR) called
PIRCS, acronym for Probabilistic Indexing and Retrieval - Components - System. It provides storage and
retrieval capability based on collection statistics using content terms as index terms for document
representation. Its design is based on a network, and has flexibility in mind more than efficiency or
performance, so that future and unforseen approaches to IR may be supported. As it turns out, when the
TREC experiments were announced, we found that our system fits nicely with the requirements. Certain
changes (mostly because of large file sizes) and a number of peripheral programs are needed, but the basic
design remains intact. Because available RAM and disk space are limited, we can only handle Category
B Wall Street Journal files (WSJ 500 MByte raw data); however, we forsee no problem running the
system for Category A files (2 GByte raw data) if we have the appropriate hardware. In what follows we
shall describe our strategies for IR and TREC in Section 2, system description in Section 3 and the
Appendix, results and discussions in Section 4, and conclusions in Section 5.
2. Strategies for IR and TREC Experiments
Over the past twenty five or 50 years, a number of techniques have been known to work in IR for small
to medium collections. These include: manually created thesauri and phrases to enhance recall and
precision; term weighting that can account for importance of content and discrimination; user relevance
feedback and feedback with query expansion; combining multiple retrieval methods, and using multiple
document and query representations. Many other methods exist and have been experimented with, (such
as clustering, natural language processing, various artificial intelligence techniques, etc.) but their
effectiveness in general are still in question. Our strategy to IR and to the TREC experiments is based
on the following that reflect on the previous workable techniques: 1) use of document components; 2)
two-word phrases; 3) combination of retrieval methods, and 4) network implementation with learning.
We shall discuss how and why each of these methodologies may help contribute to the effectiveness of
our results.
153