SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) TREC-2 Document Retrieval Experiments using PIRCS chapter K. Kwok L. Grunfeld National Institute of Standards and Technology D. K. Harman TREC2 Document Retrieval Experiments using PIRCS K.L. Kwok & L. Grunfeld Computer Science Dept., Queens College, CUNY Flushing, NY 11367. email: kklqc@cunyvm.cuny.edu Abstract We performed the full experiments, using our network implementation of component probabilistic indexing and retrieval model. Documents were enhanced with a list of semi-automatically generated two-word phrases, and queries with automatic Boolean expressions. An item self-learning procedure was used to initiate network edge weights for retrieval. Initial results submitted were above median for ad hoc, and below median for routing. They were not up to expectation because of a bad choice of high-frequency cut- off for terms, and no query expansion for routing. Later experiments showed that our system does return very good results after correcting the earlier problems and adjusting some parameters. We also re-design our system to handle virtually any number of large files in an incremental fashion, and to do retrieval and learning by initiating our network on demand, without first creating a full inverted file. 1. Introduction In TRECl our system called PIRCS (acronym for Probabilistic Indexing and Retrieval -Components- System) took part as Category B participant, handling only the 0.5 GB Wall Street Journal collection because both our software and hardware were not sufficient for the full set of text files. In TREC2, we participated in category A. However, during a large portion of the time period we have to face fairly uncertain and sometimes difficult conditions. Plans to install a dedicated SPARC1O workstation and associated large memory and disk drives did not materialize until about three weeks from the deadline. Before this period, the SPARC2 workstation that we have been using was also shared with other users during the semester. Certain things that we wished to do were not done, and comers were cut to fit programs and data in the existing system. Much of our time was spent revamping our software to be more efficient in space and time utilization. Our focus remains as in ThECi, namely, to improve representations of documents and queries, to test different 233 learning methods and to combine different retrieval methods to improve final ranked retrieval output. Section 2 summarizes our retrieval network; Section 3 discusses our improved system design; Section 4 is on item representation; Sections 5&6 are about our learning and retrieval procedures; Section 7 discusses the results we submitted and Section 8 contains results of our later experiments. Section 9 follows with the conclusion. 2. A Retrieval Network in PIRCS Our retrieval process is based on a three layer Q-T-D (Query-Term-Document) network, details of which are given in [KwPK93,Kwok9xj. Here we give a review. From Fig. 1, DTQ query-focused retrieval means spreading an initial activation of 1 (one) from a document di towards query [OCRerr] and gated by intervening edges wk[OCRerr] and w[OCRerr]. The resultant activation received at qa is W[OCRerr]q = 4 w[OCRerr]*w[OCRerr], and is the retrieval status value [OCRerr]SV) of di with respect to qa. When activation initiated at qa spreads towards di[OCRerr] we obtain activation received at di equals to W[OCRerr] = 4 w[OCRerr]*w[OCRerr], and is our QTD document-focused RSV for di. Combining the two additively: W[OCRerr] = W[OCRerr] + Wvd gives our basic retrieval ranking function. Edge weights w[OCRerr], w[OCRerr] represent items (qa or di) acting on terms [OCRerr] and reflect usage of terms within items. Edge weights w[OCRerr], w[OCRerr] (representing tk acting on qa or di) embed Bayesian inference and are initialized based on a component consideration of probabilistic indexing and retrieval. These weights can improve via a learning process when relevant documents are known to queries and vice versa: DTQ query-focused training when we know the set of documents and their components relevant to a query, and QTD document-focused training when we know the set of queries and their components relevant to a document. Query-focused training prepares queries to match new similar documents better, while document-focused training helps documents to match new similar queries better. With learning capability, the net behaves and can be viewed as a superposition of two 2-layer direct-connect artificial neural networks, one in each direction. If a boolean expression for query qa is known, it can also be represented as a tree and hung onto the net as shown in Fig.2. Edge weights from