SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
to which each of the two kinds of relativization might best
be introduced. To investigate the first kind of relativiza-
tion -- relativization with respect to query or document
length -- a semi-relativized stem frequency variable of the
form
absolute frequency
document length + C
was adopted. If the constant C is chosen to be zero, one
has full relativization, whereas a value for C much greater
than the document length will cause the variable to behave
in the regression analysis as though it were an absolute
frequency. Several logistic regressions were run to dis-
cover by trial-and-error the value of C that produced the
best regression fit to the TIPSTER learning data and the
highest precision and recall. It was found that a value of
around C = 35 for queries, and C = 80 for documents,
optimized performance.
For the relativization with respect to the global rela-
tive frequency in the collection, the relativizing effect of
the denominator was weakened by a slighfly different
method -- by raising it to a power less than 1.0. For the
document frequencies, for instance, a variable of form
absolute frequency in doc
document length + 80
[relative frequency in collection ] D
with D < 1 was in effect used. Actually, it was the loga-
rithm of this expression that was ultimately adopted,
which allowed the variable to be broken up into a differ-
ence of two logarithmic expressions. The optimal value
of the power D was therefore obtainable in a single
regression as the coefficient of the logged denominator.
Thus the variables ultimately adopted consisted in
essence of sums over two optimally relativized frequen-
cies (`ORF's) -- one for match stem frequency in the
query and one for match stem frequency in the document.
Because of the logging this breaks up mathematically into
three variables. A final logistic regression using sums
over these variables as indicated in Eq. (2) produced the
ranking equation
Equation (3):
log O(RI A1,...,AM)
-3.51 + 1 [OCRerr] + 0.0929M
61
where [OCRerr] is the expression
M M M
37.4 X Xm,i +0.330 X Xm,2 -0.1937 X Xm,3
m=1 m=1 m=1
Here
Xm,i = number of times the m'th stem occurs in the
query, divided by (total number of all stem occur-
rences in query + 35);
Xm,2 = number of times the m'th stem occurs in the
document, divided by (total number of all stem
occurrences in document + 80), quotient logged;
Xm,3 = number of times the m'th stem occurs in the
collection, divided by the total number of all stem
occurrences in the collection, quotient logged;
M = number of distinct stems common to both
query and document.
Although Eq. (2) calls for an M2 term as well, such a term
was found not to make a statistically significant contribu-
tion and so was eliminated.
Eq. (3) provided the ranking rule used in the ad hoc
run (labeled `Brkly3') and the first routing run (`Brkly4')
submitted for Trec-2. The equation is notable for the spar-
sity of the information it uses. Essentially it exploits only
two ORF values for each match stem, one for the stem's
frequency in the query and the other for its frequency in
the document. Other variables were tried including the
inverse document frequency (both logged and unlogged),
a variable consisting of a count of all two-stem phrase
matches, and several variables for measuring the tendency
of the match stems to bunch together in the document. All
of these exhibited predictive power when used in isola-
tion, but were discarded because in the presence of the
ORF's none produced any detectable improvement in the
regression fit or the precisionfrecall performance. Some
attempts at query expansion using the WordNet thesaurus
also failed to produce noticable improvement, even when
care was taken to create separate variables with sepan[OCRerr]te
coefficients for the synonym-match counts as opposed to
the exact-match counts.
The quality of a retrieval output ranking matters
most near the top of the ranking where it is likely to affect
the most users, and maUers hardly at all far down the
ranking where hardly any users are apt to search. Because
of this it is desirable to adopt a sampling methodology that
produces an especially good regression fit to the sample