SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman Terms Wts AveP P5 P30 P100 RP Rd All Final 0.407 1.000 0.867 0.640 0.457 0.739 Topic Final 0.362 1.000 0.733 0.620 0.440 0.698 Topic Orig 0.373 0.600 0.800 0.700 0.433 0.680 Discussion The main motive for experimenting with this type of query expansion is that it is one way of finding terms which are in some sense closely associated with the query as a whole. It does not fit particularly well with the Robertson/Sparck Jones type of probabilistic theory [5], the validity of which depends on pairwise indepen- dence of terms in both relevant and nonrelevant docu- ments. However, it is clear, if only from the results in this paper, that mutual dependence does not necessarily lead to poor results. There are many variables involved. In our rather lim- ited experiments most of the initial feedback searches were done under the conditions of the first row of Ta- ble 2, that is with terms from title, concepts, narrative and description (there were a few runs using title and concepts only, but the results for most topics were not good); and weighting function BM11 with termweights given by equation 6 with large k3 (1000). This gave nearly the best precision at 5 and 30 documents of any of our results. The number of feedback documents was constant across topics and was varied between 10 and 50. For the final search, terms were always weighted with BM11, but several values of k3 were tried (in- cluding zero). Some runs used topic terms only and some used expansion terms as well. There was one run omitting narrative and description terms from the final search, but it was not among the very best and is not reported in the table. The number of terms in the final search was varied from 10 upwards, terms being selected as usual in descending order of Iermwe?gh[OCRerr] x [OCRerr]Rr. Some evaluations were done using frozen ranks, in case the initial searches tended to give better low precision, but this turned out not to be the case. A few of the results are summarised in Table 6. They include results which appear better than the best oth- erwise obtained, but the difference is small, and these runs have not yet been repeated on the other topic sets. A qq weight component is still needed (compare rows 2 and 14 of the table). The number of feedback docu- ments is not critical. Speeding searching by using only the top 10 or 20 terms is detrimental. It is interesting that results do not seem to be very greatly affected by the precision of the feedback set. Looking at the individual topics in the run represented by the top row of Table 6, 25 did better than in the feedback run, 18 did worse and the remainder about the same. Restricting to the 20 topics where the precision at 30 in the feedback set was below 0.5, the corresponding figures are 7,10 and 3. 29 6.2 Stemming A comparison was made on the AP database between the normal Okapi stemming which removes many suf- fixes and a "weak" stemming procedure which only con- flates singular and plural forms and removes "mg" end- ings. For some weighting functions weak stemming in- creased precision by about 2% and decreased recall by about 1%, but the observed difference is unlikely to be significant. 6.3 Stoplists Some runs were done on the AP database to investigate the effect of stoplist size. A small stoplist consisted of the 17 words a, the, an, at, by, into, on, for, from, to, with, of' and, or, in, not, et and a large one contained 209 articles, conjunctions, prepositions, pronouns and verbs. There was no significant difference in the results of the runs, but the index size was about 25% greater with the small stoplist. 7 Conclusions and prospects 7.1 The new probabilistic models The most significant result is perhaps the great improve- ment in the automatic results brought about by the new term weighting models. In the ad-hoc runs, with no qif component, BM1S is 14% better than BM1 on average precision and about 9% better on high precision and re- call. The corresponding figures for BM11 are 51% and 34% (Table 3). For the routing runs, where a consider- able amount of relevance information had contributed to the term weights, the improvement is less, but still very significant (Table 4). For the manual feedback searches (Table 5) there was a small improvement when they were re-run with BM11 replacing BM1S in the final it- eration. The drawback of these two models is that the theory says nothing about the estimation of the constants, or rather parameters, k1 and k2. It may be assumed that these depend on the database, and probably also on the nature of the queries and on the amount of relevance information available. We do not know how sensitive they are to any of these factors. Estimation cannot be done without sets of queries and relevance judgments, and even then, since the models are not linear, they do not lend themselves to estimation by logistic regression. The values we used were arrived at by long sequences of trials mainly using topics 51-100 on the disks 1 and 2 database, with the TREC-1 relevance sets.