SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-2 Document Retrieval Experiments using PIRCS
chapter
K. Kwok
L. Grunfeld
National Institute of Standards and Technology
D. K. Harman
non[OCRerr]content terms based on word sequence patterns. We
continue to use words from the title, description, narrative
and c(Nlcept sections of the topics to form queries.
Experiments without the narrative section show slight
decrease in performance for our system in contrast to
[Crof931. We also try to produce automatically Boolean
expressions as queries from the description and concepts
sections. This is done by using punctuations to delineate
phrases and ANDing the words within them. Phrases are
teen ORed. This is a very cnide way of getting Boolean
expressions for later soft-Boolean retrieval processing.
5. Learning Procedures
`Ilour network of Fig.1 the edge weights determine the
retrieval outcome. wk[OCRerr] and wk. captures the proportion of
term[OCRerr]indjor[OCRerr] Theyarefixedandobtainedfromthe
manifestation of terms in the respective items. w[OCRerr] (and w[OCRerr]
has a log odds factor, and an inverse Collection Term
Frequency (1CIl[OCRerr]) factor which is regarded as a constant for
a term [OCRerr], as follows:
= In [rj(1-r[OCRerr])] + ICn[OCRerr]k (1)
Here r[OCRerr] is the conditional probability that given relevance
to qa that term tk occurs, and needs to be learnt. It is
unknown unless one has a sample of relevant documents to
qa. This is not applicable to inltial ad hoc retrievals where
a document collection is being processed against a new
query with no known results. Relevance feedback
information can remedy this later, but it is not available at
the begining. One way is to ignore the log odds factor in
EqILl, as done in our [OCRerr]IREC1 experiments. resulting in
ICIF weighting. A better way, which we use for ThEC?,
is to include item self-learning to determine [OCRerr] and initiate
d[OCRerr] as `query'
the term weights. This is shown in Fig.3 and is based on
the following argument. Consider a docnment di containlng
certain concepts and topics. Imagine this author wiShing to
inquire the textbase for the same topics as this document
s/lie has written; what query would be most suitable?
Naturally the author's own words in the document can serve
as the `query', and there is also known to be one relevant
item to this `query' in the collection, viz. the document
itself. in other words, every item is assumed to be self-
relevant. One relevant item is however not sufficient for
estimating r[OCRerr]. Our method is to consider each document as
constituted of many independent conceptual components
each being described by a list of terms. We therefore work
m a component universe rather than in the docement
collection. Components can be units such as sentences or
phrases, but we have used single terms for simplicity.
Right from the start then, even without any relevance
feedback. we can divide the component universe into two
parts: one set relevant, and the other non-relevant to each
the `query'. Standard probabilistic retrieval theory now
enables us to defme this `query' optimally -- meaning that
the defmed `query' will rank its set of relevant components
optimally with respect to the other components when used
for retrieval. The definition of this `query' (i.e. its terms
and weights) becomes the iuitial indexing representation of
the document, and it is used in our ad hoc QTD retrieval.
This is the principle of document self-recovery introduced
in [OCRerr]wok9O) and implemented as a self4earning process in
a network [OCRerr]wok89,[OCRerr]J shown in FigA. One can argue that
this relevant set of components from one document is too
small. But that is all the information one has at this stage.
Previous experiments [OCRerr]wok9Oj show that this kind of
weighting can outperform ICIF weight by a few percent.
Moreover, our network self-learning parameters can be
adjusted to provide a smooth transition from ICIF to full
seif[OCRerr]learn weights. or any value in between. We invoke the
d1 as query'
A A
relev Q irrelev relev d1
5 5
tin
5
5 5
5 components
5
5
document space
I
G
o 0 0
-JO 0
0
Qo 0
irrelev
0 0
0
0 oo
0
component space
Documents are not monolithic, but constituted of components
Fig.3 Item Self-Learn Using Its Own Self-Relevant Components
237