SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
TABLE II: Reinterpretation of the Components of Eq. (1) for Routing Retrieval
n(R)
log 0(R) log
log 0(RIA1)- log 0(R)
log nfl(TT1l,[OCRerr]RJ)++Pp n[OCRerr](R[OCRerr])
n(R)
log
log 0(RIAM)- log 0(R) log
n(TM, R) + p n(R)
n(TM, [OCRerr] + p n(A)
n(TM+1,R)+ P n(R)
log 0(R I AM+1) - log 0(R) log n(TM+l,R)+ p n(R)
fl(TQ, R) + p n(R)
log O(RIAQ) - log 0(R) log n(TQ,[OCRerr]+fl n(R)
n(R)
- log-
n(R)
n(R)
- log
n(R)
- log
n(R)
M n(T[OCRerr],R)+fl n(R) Q n(TTh,R)+pn(R) n(R)
log 0(RIA1 ,..., AQ) - [OCRerr]log + [OCRerr] log
nonspecific set may well be available at the same time. If
so the theory developed in the foregoing section can be
applied in conjunction with the earlier theory to capture
the benefits of both kinds of learaing sets. The retrieval
equation will then contain variables not only of the kind
occurring in Eq. (4) but also of the Eq. (2) kind.
It is convenient to formulate this equation in such a
way that it contains as one of its terms the entire ranking
expression developed earlier for the nonspecific learning
dat[OCRerr] For the [OCRerr]IlPSTER data the combined equation takes
the form:
Equation (S):
log 0(RIA1,...,AM, A'1....
0.688 [OCRerr]4+0.344[[OCRerr]1 +[OCRerr]2-[OCRerr]3]+0.0623
where [OCRerr] is the entire right side of Eq. (3) and [OCRerr]1, [OCRerr] [OCRerr]3
are as defined in Eq. (4). This form for the equation is
computationally convenient if Eq. (3) is to be used as a
preliminary screening rule to eliminate unpromising
63
documents, with Eq. (5) in its entirety applied only to rank
those that survive the screening.
Eq. (5) was used to produce the Trec-2 routing run
`Brkly5'. Its coefficients were determined by a logistic
regression constrained in such a way as to make the
query-specific variables contribute about twice as heavily
as the nonspecific, when contribution is measured by stan-
dardized coefficient size. This emphasis was largely arbi-
trary; finding the optimal balance between the query-
specific and the general contributions remains a topic for
future research. A value of 201n(R) was used for p. This
choice too was arbitrary, and it would be interesting to try
to optimize it experimentally for some typical collection
(trying out, perhaps, numbers larger than 20, divided by
the total number of documents in the query's learning set).
No restraining function f(Q) was used in the final form of
Eq. (5) because none that were tried out produced any dis-
cernible improvement in fit or retrieval effectiveness in
this context.