NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman to which each of the two kinds of relativization might best be introduced. To investigate the first kind of relativiza- tion -- relativization with respect to query or document length -- a semi-relativized stem frequency variable of the form absolute frequency document length + C was adopted. If the constant C is chosen to be zero, one has full relativization, whereas a value for C much greater than the document length will cause the variable to behave in the regression analysis as though it were an absolute frequency. Several logistic regressions were run to dis- cover by trial-and-error the value of C that produced the best regression fit to the TIPSTER learning data and the highest precision and recall. It was found that a value of around C = 35 for queries, and C = 80 for documents, optimized performance. For the relativization with respect to the global rela- tive frequency in the collection, the relativizing effect of the denominator was weakened by a slighfly different method -- by raising it to a power less than 1.0. For the document frequencies, for instance, a variable of form absolute frequency in doc document length + 80 [relative frequency in collection ] D with D < 1 was in effect used. Actually, it was the loga- rithm of this expression that was ultimately adopted, which allowed the variable to be broken up into a differ- ence of two logarithmic expressions. The optimal value of the power D was therefore obtainable in a single regression as the coefficient of the logged denominator. Thus the variables ultimately adopted consisted in essence of sums over two optimally relativized frequen- cies (`ORF's) -- one for match stem frequency in the query and one for match stem frequency in the document. Because of the logging this breaks up mathematically into three variables. A final logistic regression using sums over these variables as indicated in Eq. (2) produced the ranking equation Equation (3): log O(RI A1,...,AM) -3.51 + 1 [OCRerr] + 0.0929M 61 where [OCRerr] is the expression M M M 37.4 X Xm,i +0.330 X Xm,2 -0.1937 X Xm,3 m=1 m=1 m=1 Here Xm,i = number of times the m'th stem occurs in the query, divided by (total number of all stem occur- rences in query + 35); Xm,2 = number of times the m'th stem occurs in the document, divided by (total number of all stem occurrences in document + 80), quotient logged; Xm,3 = number of times the m'th stem occurs in the collection, divided by the total number of all stem occurrences in the collection, quotient logged; M = number of distinct stems common to both query and document. Although Eq. (2) calls for an M2 term as well, such a term was found not to make a statistically significant contribu- tion and so was eliminated. Eq. (3) provided the ranking rule used in the ad hoc run (labeled `Brkly3') and the first routing run (`Brkly4') submitted for Trec-2. The equation is notable for the spar- sity of the information it uses. Essentially it exploits only two ORF values for each match stem, one for the stem's frequency in the query and the other for its frequency in the document. Other variables were tried including the inverse document frequency (both logged and unlogged), a variable consisting of a count of all two-stem phrase matches, and several variables for measuring the tendency of the match stems to bunch together in the document. All of these exhibited predictive power when used in isola- tion, but were discarded because in the presence of the ORF's none produced any detectable improvement in the regression fit or the precisionfrecall performance. Some attempts at query expansion using the WordNet thesaurus also failed to produce noticable improvement, even when care was taken to create separate variables with sepan[OCRerr]te coefficients for the synonym-match counts as opposed to the exact-match counts. The quality of a retrieval output ranking matters most near the top of the ranking where it is likely to affect the most users, and maUers hardly at all far down the ranking where hardly any users are apt to search. Because of this it is desirable to adopt a sampling methodology that produces an especially good regression fit to the sample