NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman TABLE II: Reinterpretation of the Components of Eq. (1) for Routing Retrieval n(R) log 0(R) log log 0(RIA1)- log 0(R) log nfl(TT1l,[OCRerr]RJ)++Pp n[OCRerr](R[OCRerr]) n(R) log log 0(RIAM)- log 0(R) log n(TM, R) + p n(R) n(TM, [OCRerr] + p n(A) n(TM+1,R)+ P n(R) log 0(R I AM+1) - log 0(R) log n(TM+l,R)+ p n(R) fl(TQ, R) + p n(R) log O(RIAQ) - log 0(R) log n(TQ,[OCRerr]+fl n(R) n(R) - log- n(R) n(R) - log n(R) - log n(R) M n(T[OCRerr],R)+fl n(R) Q n(TTh,R)+pn(R) n(R) log 0(RIA1 ,..., AQ) - [OCRerr]log + [OCRerr] log nonspecific set may well be available at the same time. If so the theory developed in the foregoing section can be applied in conjunction with the earlier theory to capture the benefits of both kinds of learaing sets. The retrieval equation will then contain variables not only of the kind occurring in Eq. (4) but also of the Eq. (2) kind. It is convenient to formulate this equation in such a way that it contains as one of its terms the entire ranking expression developed earlier for the nonspecific learning dat[OCRerr] For the [OCRerr]IlPSTER data the combined equation takes the form: Equation (S): log 0(RIA1,...,AM, A'1.... 0.688 [OCRerr]4+0.344[[OCRerr]1 +[OCRerr]2-[OCRerr]3]+0.0623 where [OCRerr] is the entire right side of Eq. (3) and [OCRerr]1, [OCRerr] [OCRerr]3 are as defined in Eq. (4). This form for the equation is computationally convenient if Eq. (3) is to be used as a preliminary screening rule to eliminate unpromising 63 documents, with Eq. (5) in its entirety applied only to rank those that survive the screening. Eq. (5) was used to produce the Trec-2 routing run `Brkly5'. Its coefficients were determined by a logistic regression constrained in such a way as to make the query-specific variables contribute about twice as heavily as the nonspecific, when contribution is measured by stan- dardized coefficient size. This emphasis was largely arbi- trary; finding the optimal balance between the query- specific and the general contributions remains a topic for future research. A value of 201n(R) was used for p. This choice too was arbitrary, and it would be interesting to try to optimize it experimentally for some typical collection (trying out, perhaps, numbers larger than 20, divided by the total number of documents in the query's learning set). No restraining function f(Q) was used in the final form of Eq. (5) because none that were tried out produced any dis- cernible improvement in fit or retrieval effectiveness in this context.