SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman The Fundamental Equation Equation (1): As the starting point of our probabilistic retrieval model we adopt (with reservations to be explained presently) a statistical simplifying assumption called the `Assumption of Linked Dependence' (Cooper 1991). Intuitively, it states that in the universe of all query- document pairs, the degree of statistical dependency that exists between properties of pairs in the subset of all rele- vance-related pairs is linked in a certain way with the degree of dependency that exists between the same pair- properties in the subset of all nonrelevance-related pairs. Under at least one crude measure of degree of dependence (specifically, the ratio of the joint probability to the prod- uct of individual probabilities), the linkage in question is one of identity. That is, the claim made by the assumption is that the degree of the dependency is the same in both sets. For N pair-properties ..... . , AN, the mathematical statement of the assumption is as follows. ASSUM[OCRerr]flON OF LINKED DEPENDENCE: P(A1....ANIR) _ P(A1IR) P(ANIR) -xx P(A1,..., ANIR) - P(A1IR) P(ANIA) If one thinks of a query and document drawn at random, R is the event that the document is relevant to the query, while each clue A[OCRerr] is some respect in which the document is similar or dissimilar to, or somehow related to, the query. For most clue-types likely to be of interest this sim- plifying assumption is at best only approximately true. Though not as implausible as the strong conditional inde- pendence assumptions traditionally discussed in proba- bilistic IR, still it should be regarded with skepticism. In practice, when the number N of clues to be combined becomes large the assumption can become highly unreal- istic, with a distorting tendency that often causes the prob- ability of relevance estimates produced by the resulting retrieval model to be grossly overstated for documents near the top of the output ranking. But it will be expedi- ent here to adopt the assumption anyway, on the under- standing that later on we shall have to find a way to curb its probability-inflating tendencies. From the Assumption of Linked Dependence it is possible to derive the basic equation underlying the proba- bilistic model: 58 log 0(RIA1....AN) = N log 0(R) + [OCRerr] [log 0(R I A[OCRerr]) - log 0(R)] where for any events E and E' the odds 0(E I E') is by P(E I E') definition P(E I E') Using this equation, the evidence of N separate clues can be combined as shown in the right side to yield the logodds of relevance based on all clues, shown on the left. Query-document pairs can be ranked by this logodds estimate, and a ranking of documents for a particular query by logodds is of course a probability ranking in the IR sense. For further discussion of the linked dependence assumption and a formal derivation of Eq. (1) from it, see Cooper (1991) and Cooper, Dabney & Gey (1992). An empirical investigation of independence and dependence assumptions in IR is reported by Gey (1993). Application to Term Properties We shall consider as a ftrst application of the model- ing equation the problem of how to exploit the properties of match terms. By a `match' term we mean a stem (or word, phrase, etc.) that occurs in the query and also, in identical or related form, in the document to which the query is being compared. The retrieval properties of a match term could include its frequency characteristics in the query, in the document, or in the collection as a whole, its grammatical characteristics, the type or degree of `match' involved, etc. If there are M match terms that relate a certain query to a certain document, Eq. (1) becomes applicable with N set to M. Each of the proper- ties [OCRerr]..... ,AM then represents a composite clue or set of properties concerning one of the query-document pair's match terms. Suppose a `learning set' of human judgements of relevance or nonrelevance is available for a sample of rep- resentative query-document pairs. (However, for the time being we assume no such learning data is available for the very queries on which the retrieval is to be performed.) Logistic regression can be applied to the data to develop an equation capable of using the match term clues to esti- mate, for any query-document pair, values for each of the expressions in the right side of Eq. (1). Eq. (1) then yields a probability estimate for the pair.