SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression chapter W. Cooper F. Grey A. Chen National Institute of Standards and Technology Donna K. Harman called a `logit', is log 0(E), or in the case of conditional odds, log 0(E1 I E2). Natural logarithms will be used throughout the sequel. The first equation estimates the logodds that a document is relevant to a query, given just one composite clue relating query to document. For the ThEC experiment a composite clue was defined to be a set of six frequency properties of a particular stem match, where a `stem match' is understood to be the event that some word stem has been found to occur at least once in both the query and document. Consider a composite clue A[OCRerr] consisting of the six elementary properties ..... , X6 that describe some stem match. Let R denote the possible event that the document under consideration is in fact relevant to the query under consideration. The first equation, derived by logistic regression from a learning sample for the Wall Street Journal (WSJ) collection, is log 0(R IA[OCRerr]) = log 0(R I X1, X2, X3, X4, X5, X6) - 7.08+.38x1+.O4x2+.77x3[OCRerr].O7x4+l.O5x5+.23X6 where (1) X1 = the log of the absolute frequency of occurrence of the stem in the query; i.e. a simple count of its occurrences in the query, logged. X2 = the log of the relative frequency of occurrence of the stem in the query; i.e. of X1 divided by the query length, with query length defined as the total number of occurrences of all word stems in the query. X3 = the log of the absolute frequency of occurrence of the stem in the document. X4 = the log of the relative frequency of occurrence of the stem in the document. X5 = the log of the inverse document frequency of the stem in the collection; i.e. the proportion of documents containing at least one occurrence of the stem, inverted and logged. X6 = the log of the global relative frequency of the stem in the collection; i.e. the fraction of word occurrences in the entire collection that are occurrences of the stem in question, logged. The equation says roughly that if a TIPSThR query and a WSJ document were chosen at random, and a certain stem were found to occur in them both with frequency statistics x1 ,..., X6, then in the absence of other knowledge it would be appropriate to apply this formula to estimate the logodds that the document is relevant to the query. Next, suppose a query and document have N terms in common, leading via Eq. (1) to N different logodds estimates, one for each of the term matches A1 through AN. The second equation allows these estimates to be combined into a preliminary estimate of the logodds that the document is relevant to the query. It has the form N (2) log 0(RIA1 ,..., AN) = log 0(R)+X [log 0(RIA[OCRerr])-log 0(R)] i=1 The only quantity in the right side that cannot be estimated using Eq. (1) is the prior logodds log 0(R). The value -6.725 was used for this parameter. 77