SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-2 Document Retrieval Experiments using PIRCS
chapter
K. Kwok
L. Grunfeld
National Institute of Standards and Technology
D. K. Harman
TREC2 Document Retrieval Experiments using PIRCS
K.L. Kwok & L. Grunfeld
Computer Science Dept., Queens College, CUNY
Flushing, NY 11367. email: kklqc@cunyvm.cuny.edu
Abstract
We performed the full experiments, using our network
implementation of component probabilistic indexing and
retrieval model. Documents were enhanced with a list of
semi-automatically generated two-word phrases, and queries
with automatic Boolean expressions. An item self-learning
procedure was used to initiate network edge weights for
retrieval. Initial results submitted were above median for ad
hoc, and below median for routing. They were not up to
expectation because of a bad choice of high-frequency cut-
off for terms, and no query expansion for routing. Later
experiments showed that our system does return very good
results after correcting the earlier problems and adjusting
some parameters. We also re-design our system to handle
virtually any number of large files in an incremental
fashion, and to do retrieval and learning by initiating our
network on demand, without first creating a full inverted
file.
1. Introduction
In TRECl our system called PIRCS (acronym for
Probabilistic Indexing and Retrieval -Components- System)
took part as Category B participant, handling only the 0.5
GB Wall Street Journal collection because both our
software and hardware were not sufficient for the full set of
text files. In TREC2, we participated in category A.
However, during a large portion of the time period we have
to face fairly uncertain and sometimes difficult conditions.
Plans to install a dedicated SPARC1O workstation and
associated large memory and disk drives did not materialize
until about three weeks from the deadline. Before this
period, the SPARC2 workstation that we have been using
was also shared with other users during the semester.
Certain things that we wished to do were not done, and
comers were cut to fit programs and data in the existing
system. Much of our time was spent revamping our
software to be more efficient in space and time utilization.
Our focus remains as in ThECi, namely, to improve
representations of documents and queries, to test different
233
learning methods and to combine different retrieval methods
to improve final ranked retrieval output. Section 2
summarizes our retrieval network; Section 3 discusses our
improved system design; Section 4 is on item
representation; Sections 5&6 are about our learning and
retrieval procedures; Section 7 discusses the results we
submitted and Section 8 contains results of our later
experiments. Section 9 follows with the conclusion.
2. A Retrieval Network in PIRCS
Our retrieval process is based on a three layer Q-T-D
(Query-Term-Document) network, details of which are
given in [KwPK93,Kwok9xj. Here we give a review.
From Fig. 1, DTQ query-focused retrieval means spreading
an initial activation of 1 (one) from a document di towards
query [OCRerr] and gated by intervening edges wk[OCRerr] and w[OCRerr]. The
resultant activation received at qa is W[OCRerr]q = 4 w[OCRerr]*w[OCRerr], and
is the retrieval status value [OCRerr]SV) of di with respect to qa.
When activation initiated at qa spreads towards di[OCRerr] we obtain
activation received at di equals to W[OCRerr] = 4 w[OCRerr]*w[OCRerr], and is
our QTD document-focused RSV for di. Combining the
two additively: W[OCRerr] = W[OCRerr] + Wvd gives our basic retrieval
ranking function. Edge weights w[OCRerr], w[OCRerr] represent items (qa
or di) acting on terms [OCRerr] and reflect usage of terms within
items. Edge weights w[OCRerr], w[OCRerr] (representing tk acting on qa
or di) embed Bayesian inference and are initialized based on
a component consideration of probabilistic indexing and
retrieval. These weights can improve via a learning process
when relevant documents are known to queries and vice
versa: DTQ query-focused training when we know the set
of documents and their components relevant to a query, and
QTD document-focused training when we know the set of
queries and their components relevant to a document.
Query-focused training prepares queries to match new
similar documents better, while document-focused training
helps documents to match new similar queries better. With
learning capability, the net behaves and can be viewed as a
superposition of two 2-layer direct-connect artificial neural
networks, one in each direction. If a boolean expression for
query qa is known, it can also be represented as a tree and
hung onto the net as shown in Fig.2. Edge weights from