SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Probabilistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection
chapter
N. Fuhr
U. Pfeifer
C. Bremkamp
M. Pollmann
National Institute of Standards and Technology
D. K. Harman
nt[OCRerr] and ltc formulas. However, a possible explanation
could be the fact that the regression method tries to
minimize the quadratic error for all the documents in
the learning sample, but our evaluation measure con-
siders at most the top ranking 1000 documents for each
query; so regression might perform well for most of the
documents from the database, but not for the top of
the ranking list. There is some indication for this ex-
planation, since regression yields always slightly better
results at the high recall end.
& result
0.00 0.3199
0.10 0.3707
0.15 0.3734
0.20 0.3700
0.25 0.3656
0.30 0.3610
0.50 0.3451
1.00 0.3147
Table 4: Effect of downweighting
Q2/D12)
(and thus also with ii and t2), a document dm containig
the phrase would yield Uim + U2m + U3m as value of the
retrieval function, where the weights Uim are computed
by the lsp method described before. In order to avoid
the effect of counting the single words in addition to
the phrase, we modified the original phrase weight as
follows:
U3m = U3m - Ulm - U2m
and stored this value as phrase weight. Queries with
the single words t1 or t2 are not affected by this modi-
fication. For the query with phrase t3, however, the re-
trieval function now would yield the value Uim + U2m +
U'3m = U3m, which is what we would like to get.
QTW & result
reg 0.00 0.2724
reg 1.00 0.2596
ntc 0.00 0.2754
ntc 0.15 0.3110
ntc 1.00 0.2524
of phrases (sample Table 6: Results for the subtraction method (sample
Q1/D12)
As described before, in our indexing process, we con-
sider phrases in addition to single words. This leads to
the problem that when a phrase occurs in a document,
we index the phrase in addition to the two single words
forming the phrase. As a heuristic method for overcom-
ing this problem, we introduced a factor for downweight-
mg query term weights for phrases. That is, the actual
query term weight of a phrase 15 C'ik = &Cik, where Cik
is the result of the regression process. In order to de-
rive a value for a, we performed a number of test runs
with varying values (see table 4). Obviously, weighting
factors between 0.1 and 0.3 gave the best results. For
the official runs, we choose & 0.15.
sample
QTW a Q1/D12 Q2/D3
ltc 0.15 0.3192 0.3131
ltc 0.2 0.3220 0.3056
reg 0.15 0.3080 0.3062
Table 5: Results for single words and phrases
In table 5, this method is compared with the ltc formula,
where we also choose a weighting factor for phrases
which gave the best results. One can see that with the
sample Q2/D3, the differences between the methods are
smaller than on sample Q1/D12, but still ltc seems to
perform slightly better.
Finally, we investigated another method for coping with
phrases. For that, let us assume that we have binary
query weights only. Now as an example, the single words
ti and i2 form a phrase t3. For a query with phrase i3
70
Table 6 shows the corresponding results (a = 0 means
that single words only are considered). In contrast to
what we expected, we do not get an improvement over
single words only when phrases are considered fully.
The result for the ntc method shows that still phrases
should be downweighted. Possibly, there may be an im-
provement with this method when we would use binary
query term weights, but it is clear that other query term
weighting methods mostly give better results.
3.3 Official runs
As document indexing method, we applied the
description-oriented approach as described in section 2.
In order to estimate the coefficients of the indexing func-
tion, we used the training sample Q12/D12, i.e. the
query sets Qi and Q2 in combination with the docu-
ments from Di and D2.
Two runs with different query term weights were sub-
mitted. Run do rtL2 is based on the nnn method, i.e. tf
weights. Run dortq2 uses reg query term weights. For
performing the regression, we used the query sets Qi
and Q2 and a sample of 400,000 documents from D1.
Table 7 shows the results for the two runs (Numbers in
parentheses denote figures close to the best/worst re-
sults.). As expected, dortq2 yields better results than
dortL2. The recall-precision curves (see figure 1) show
that there is an improvement throughout the whole re-
call range. For precision average and precision at 1000
documents retrieved, run dortQ2 performs very well,
while precision at 100 documents retrieved is less good.
This confirms our interpretation from above, saying that