NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) TREC-2 Document Retrieval Experiments using PIRCS chapter K. Kwok L. Grunfeld National Institute of Standards and Technology D. K. Harman non[OCRerr]content terms based on word sequence patterns. We continue to use words from the title, description, narrative and c(Nlcept sections of the topics to form queries. Experiments without the narrative section show slight decrease in performance for our system in contrast to [Crof931. We also try to produce automatically Boolean expressions as queries from the description and concepts sections. This is done by using punctuations to delineate phrases and ANDing the words within them. Phrases are teen ORed. This is a very cnide way of getting Boolean expressions for later soft-Boolean retrieval processing. 5. Learning Procedures `Ilour network of Fig.1 the edge weights determine the retrieval outcome. wk[OCRerr] and wk. captures the proportion of term[OCRerr]indjor[OCRerr] Theyarefixedandobtainedfromthe manifestation of terms in the respective items. w[OCRerr] (and w[OCRerr] has a log odds factor, and an inverse Collection Term Frequency (1CIl[OCRerr]) factor which is regarded as a constant for a term [OCRerr], as follows: = In [rj(1-r[OCRerr])] + ICn[OCRerr]k (1) Here r[OCRerr] is the conditional probability that given relevance to qa that term tk occurs, and needs to be learnt. It is unknown unless one has a sample of relevant documents to qa. This is not applicable to inltial ad hoc retrievals where a document collection is being processed against a new query with no known results. Relevance feedback information can remedy this later, but it is not available at the begining. One way is to ignore the log odds factor in EqILl, as done in our [OCRerr]IREC1 experiments. resulting in ICIF weighting. A better way, which we use for ThEC?, is to include item self-learning to determine [OCRerr] and initiate d[OCRerr] as `query' the term weights. This is shown in Fig.3 and is based on the following argument. Consider a docnment di containlng certain concepts and topics. Imagine this author wiShing to inquire the textbase for the same topics as this document s/lie has written; what query would be most suitable? Naturally the author's own words in the document can serve as the `query', and there is also known to be one relevant item to this `query' in the collection, viz. the document itself. in other words, every item is assumed to be self- relevant. One relevant item is however not sufficient for estimating r[OCRerr]. Our method is to consider each document as constituted of many independent conceptual components each being described by a list of terms. We therefore work m a component universe rather than in the docement collection. Components can be units such as sentences or phrases, but we have used single terms for simplicity. Right from the start then, even without any relevance feedback. we can divide the component universe into two parts: one set relevant, and the other non-relevant to each the `query'. Standard probabilistic retrieval theory now enables us to defme this `query' optimally -- meaning that the defmed `query' will rank its set of relevant components optimally with respect to the other components when used for retrieval. The definition of this `query' (i.e. its terms and weights) becomes the iuitial indexing representation of the document, and it is used in our ad hoc QTD retrieval. This is the principle of document self-recovery introduced in [OCRerr]wok9O) and implemented as a self4earning process in a network [OCRerr]wok89,[OCRerr]J shown in FigA. One can argue that this relevant set of components from one document is too small. But that is all the information one has at this stage. Previous experiments [OCRerr]wok9Oj show that this kind of weighting can outperform ICIF weight by a few percent. Moreover, our network self-learning parameters can be adjusted to provide a smooth transition from ICIF to full seif[OCRerr]learn weights. or any value in between. We invoke the d1 as query' A A relev Q irrelev relev d1 5 5 tin 5 5 5 5 components 5 5 document space I G o 0 0 -JO 0 0 Qo 0 irrelev 0 0 0 0 oo 0 component space Documents are not monolithic, but constituted of components Fig.3 Item Self-Learn Using Its Own Self-Relevant Components 237