SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection chapter N. Fuhr U. Pfeifer C. Bremkamp M. Pollmann National Institute of Standards and Technology D. K. Harman nt[OCRerr] and ltc formulas. However, a possible explanation could be the fact that the regression method tries to minimize the quadratic error for all the documents in the learning sample, but our evaluation measure con- siders at most the top ranking 1000 documents for each query; so regression might perform well for most of the documents from the database, but not for the top of the ranking list. There is some indication for this ex- planation, since regression yields always slightly better results at the high recall end. & result 0.00 0.3199 0.10 0.3707 0.15 0.3734 0.20 0.3700 0.25 0.3656 0.30 0.3610 0.50 0.3451 1.00 0.3147 Table 4: Effect of downweighting Q2/D12) (and thus also with ii and t2), a document dm containig the phrase would yield Uim + U2m + U3m as value of the retrieval function, where the weights Uim are computed by the lsp method described before. In order to avoid the effect of counting the single words in addition to the phrase, we modified the original phrase weight as follows: U3m = U3m - Ulm - U2m and stored this value as phrase weight. Queries with the single words t1 or t2 are not affected by this modi- fication. For the query with phrase t3, however, the re- trieval function now would yield the value Uim + U2m + U'3m = U3m, which is what we would like to get. QTW & result reg 0.00 0.2724 reg 1.00 0.2596 ntc 0.00 0.2754 ntc 0.15 0.3110 ntc 1.00 0.2524 of phrases (sample Table 6: Results for the subtraction method (sample Q1/D12) As described before, in our indexing process, we con- sider phrases in addition to single words. This leads to the problem that when a phrase occurs in a document, we index the phrase in addition to the two single words forming the phrase. As a heuristic method for overcom- ing this problem, we introduced a factor for downweight- mg query term weights for phrases. That is, the actual query term weight of a phrase 15 C'ik = &Cik, where Cik is the result of the regression process. In order to de- rive a value for a, we performed a number of test runs with varying values (see table 4). Obviously, weighting factors between 0.1 and 0.3 gave the best results. For the official runs, we choose & 0.15. sample QTW a Q1/D12 Q2/D3 ltc 0.15 0.3192 0.3131 ltc 0.2 0.3220 0.3056 reg 0.15 0.3080 0.3062 Table 5: Results for single words and phrases In table 5, this method is compared with the ltc formula, where we also choose a weighting factor for phrases which gave the best results. One can see that with the sample Q2/D3, the differences between the methods are smaller than on sample Q1/D12, but still ltc seems to perform slightly better. Finally, we investigated another method for coping with phrases. For that, let us assume that we have binary query weights only. Now as an example, the single words ti and i2 form a phrase t3. For a query with phrase i3 70 Table 6 shows the corresponding results (a = 0 means that single words only are considered). In contrast to what we expected, we do not get an improvement over single words only when phrases are considered fully. The result for the ntc method shows that still phrases should be downweighted. Possibly, there may be an im- provement with this method when we would use binary query term weights, but it is clear that other query term weighting methods mostly give better results. 3.3 Official runs As document indexing method, we applied the description-oriented approach as described in section 2. In order to estimate the coefficients of the indexing func- tion, we used the training sample Q12/D12, i.e. the query sets Qi and Q2 in combination with the docu- ments from Di and D2. Two runs with different query term weights were sub- mitted. Run do rtL2 is based on the nnn method, i.e. tf weights. Run dortq2 uses reg query term weights. For performing the regression, we used the query sets Qi and Q2 and a sample of 400,000 documents from D1. Table 7 shows the results for the two runs (Numbers in parentheses denote figures close to the best/worst re- sults.). As expected, dortq2 yields better results than dortL2. The recall-precision curves (see figure 1) show that there is an improvement throughout the whole re- call range. For precision average and precision at 1000 documents retrieved, run dortQ2 performs very well, while precision at 100 documents retrieved is less good. This confirms our interpretation from above, saying that