SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
The Fundamental Equation Equation (1):
As the starting point of our probabilistic retrieval
model we adopt (with reservations to be explained
presently) a statistical simplifying assumption called the
`Assumption of Linked Dependence' (Cooper 1991).
Intuitively, it states that in the universe of all query-
document pairs, the degree of statistical dependency that
exists between properties of pairs in the subset of all rele-
vance-related pairs is linked in a certain way with the
degree of dependency that exists between the same pair-
properties in the subset of all nonrelevance-related pairs.
Under at least one crude measure of degree of dependence
(specifically, the ratio of the joint probability to the prod-
uct of individual probabilities), the linkage in question is
one of identity. That is, the claim made by the assumption
is that the degree of the dependency is the same in both
sets.
For N pair-properties ..... . , AN, the mathematical
statement of the assumption is as follows.
ASSUM[OCRerr]flON OF LINKED DEPENDENCE:
P(A1....ANIR) _ P(A1IR) P(ANIR)
-xx
P(A1,..., ANIR) - P(A1IR) P(ANIA)
If one thinks of a query and document drawn at random, R
is the event that the document is relevant to the query,
while each clue A[OCRerr] is some respect in which the document
is similar or dissimilar to, or somehow related to, the
query.
For most clue-types likely to be of interest this sim-
plifying assumption is at best only approximately true.
Though not as implausible as the strong conditional inde-
pendence assumptions traditionally discussed in proba-
bilistic IR, still it should be regarded with skepticism. In
practice, when the number N of clues to be combined
becomes large the assumption can become highly unreal-
istic, with a distorting tendency that often causes the prob-
ability of relevance estimates produced by the resulting
retrieval model to be grossly overstated for documents
near the top of the output ranking. But it will be expedi-
ent here to adopt the assumption anyway, on the under-
standing that later on we shall have to find a way to curb
its probability-inflating tendencies.
From the Assumption of Linked Dependence it is
possible to derive the basic equation underlying the proba-
bilistic model:
58
log 0(RIA1....AN) =
N
log 0(R) + [OCRerr] [log 0(R I A[OCRerr]) - log 0(R)]
where for any events E and E' the odds 0(E I E') is by
P(E I E')
definition P(E I E') Using this equation, the evidence of
N separate clues can be combined as shown in the right
side to yield the logodds of relevance based on all clues,
shown on the left. Query-document pairs can be ranked
by this logodds estimate, and a ranking of documents for a
particular query by logodds is of course a probability
ranking in the IR sense. For further discussion of the
linked dependence assumption and a formal derivation of
Eq. (1) from it, see Cooper (1991) and Cooper, Dabney &
Gey (1992). An empirical investigation of independence
and dependence assumptions in IR is reported by Gey
(1993).
Application to Term Properties
We shall consider as a ftrst application of the model-
ing equation the problem of how to exploit the properties
of match terms. By a `match' term we mean a stem (or
word, phrase, etc.) that occurs in the query and also, in
identical or related form, in the document to which the
query is being compared. The retrieval properties of a
match term could include its frequency characteristics in
the query, in the document, or in the collection as a whole,
its grammatical characteristics, the type or degree of
`match' involved, etc. If there are M match terms that
relate a certain query to a certain document, Eq. (1)
becomes applicable with N set to M. Each of the proper-
ties [OCRerr]..... ,AM then represents a composite clue or set of
properties concerning one of the query-document pair's
match terms.
Suppose a `learning set' of human judgements of
relevance or nonrelevance is available for a sample of rep-
resentative query-document pairs. (However, for the time
being we assume no such learning data is available for the
very queries on which the retrieval is to be performed.)
Logistic regression can be applied to the data to develop
an equation capable of using the match term clues to esti-
mate, for any query-document pair, values for each of the
expressions in the right side of Eq. (1). Eq. (1) then yields
a probability estimate for the pair.