SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman This split level approach allows for a natural separation of the available retrieval clues into two kinds. Statistical inferences based on properties of particular terms may be drawn first, while other kinds of evidence not confined to particular terms (e.g. document length, citedness, etc.) are saved for the second stage of statistical inference. An impor- tant virtue of this split-level approach is that the second-stage regression tends to correct for biases introduced by the statistical simplifying assumptions used to consolidate the results of the first stage. A `Bare-Bones' SLR Methodology Because the sole focus of interest in the experiment was the logic of the SLR approach, it seemed appropriate to keep all other design complications to a minimum. Except for the capacity to perform the two-level probabilistic computations requisite for SLR, therefore, the experimental system was kept as simple and automatic as possible. Thus no phrase discovery, part[OCRerr]of-speech tagging, disambiguation, or other linguistically sophisticated operations were incorporated, nor was a thesaurus included for the confla- tion of synonyms or other purposes, nor was the descriptor vocabulary structured in any way. There was no clustering, no knowledge base, no set of implicative rules, no net- work, nor anything else `Al-like.' All indexing was performed extractively without bene- fit of human intervention. No use was made of the manually assigned descriptors in the document collections that had them. The experimental retrieval system was implemented by modifying the SMART system (Version 10), a suite of IR programs generously provided to the IR research com- munity by researchers at Cornell University. Since the new model to be implemented was probabilistic, all features of SMART motivated by the vector space retrieval model were left unused or replaced by corresponding probabilistic operations. The SMART stop list was used as-is except for the addition of a few common words from the query vocabulary thought unlikely to be content-indicative (e.g. `document', `contains'). The SMART system's stemmer was used without modification to strip suffixes and endings off of all query and document words that survived the stop list. The search algorithm used a slightly extended form of SMART's inverted file. The retrieval speed and general programming complexity of the SMART system were left substantially unaffected by this conversion to SLR-based retrieval, which meant that the objective of run-time efficiency and complexity roughly comparable to that of a vector processing system had been achieved. The Design Equations At the heart of the design are four statistical equations. Taken together they are capable of supplying an estimate, for any query[OCRerr]document pair of interest, of the proba- bility that the document in question is relevant to the query in question. We shall take them up in their natural order of application. Some preliminary vocabulary: A `logodds' may be regarded simply as a probabil- P(E) ity re-expressed on a special scale. The odds 0(E) of an event E is by definition The conditional odds 0(E11E2) is P(LiI E2) The `logodds' of E, sometimes also P(E11E2) 76