NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman 70:19:desc:2:judici:proceed:1 70:19:desc:2:opinion:contract:i 70: 19:desc:2:proceed:opinion: 1 70:19 tit :2:surrog:motherhood:2 where the fields are iopic number, [OCRerr]opic len9lh (number of terms counting repeats but not pairs), source field (in precedence order TITLE > CONCEPTS > NAR- RATIVE> DESCRIPTION > DEFINITIONS), num- ber of terms, term ..., frequency of this term or pair ii' the topic. 4.1.2 Document and query term weighting Table 1 shows the effect of varying query term source fields when no account is taken of within-query term frequency. Some tentative conclusions can be drawn: adding TI- TLE to CONCEPTS improves most measures slightly; TITLE alone works well in a surprising proportion of the topics; the DESCRIPTION field is fairly harmless used in conjunction with CONCEPTS, but NARRA- TIVE and DEFINITIONS are detrimental. (TIME and NATIONALITY fields, which are occasionally present, were never used.) This really only confirms what may be evident to a human searcher: that CONCEPTS con- sists of search terms, but most of the other fields apart from TITLE are instructions and guidance to relevance assessors. A sentence such as "To be relevant, a docu- ment must identify the case, state the issues which are or were being decided and report at least one ethical or legal question which arises from the case." (from the NARRATIVE field of topic 70) can only contribute noise. However, when a within-query term frequency (qi]) component is used in the term weighting, the infor- mation about the relative importance of terms gained from the use of all or most of the topic fields seems to outweigh the detrimental effect of noisy terms such as "identify", "state", issues", "question". Some results are summarised in Table 2. A number of values of k3 were tried in equation 6, and a large value proved best overall, giving the limiting case (equation 7), in which the term weight is simply multiplied by qtf. Many combinations of the weighting functions dis- cussed in Section 1.1, as well as others not described here, were first tested on the AP and/or WSJ databases. Some of them were eliminated immediately. The func- tion defined as BM15 gave almost uniformly better re- sults than t',(1), after suitable values for the constants had been found. BM11 appeared slightly less good than BMl5 on the small databases, but later runs on the large databases showed that, with suitable choice of constants, it was substantially, though not uniformly, better. This may be a consequence of the greater varia- 25 tion in document lengths found in the large databases. Table 3 compares the more elaborate term weighting functions with the standard [OCRerr](1) weighting and with a baseline coordination level run. Some work was done on the addition of adjacent pairs of topic terms to the queries (see Section 2.5). A num- ber of runs were done, using several different ways of ad- justing the "natural" weights of adjacent pairs. There was little difference between them, and the results are at best only slightly better than those from single terms alone (Table 3). There was also little difference between using all adjacent pairs and using only those pairs which derive from the same sentence of the topic, with no in- tervening punctuation. 4.2 Routing Potential query terms were obtained by "indexing" all the known relevant documents from disks 1 and 2; the topics themselves were not used (nor were known non- relevant documents). These terms were then given [OCRerr](1) weights and selection values [11] given by [OCRerr]r ><[OCRerr](1) where r and R are as in equation 1. A large number of retrospective test runs were per- formed on the complete disks 1 and 2 database, in which the number of terms selected and the weighting function were the independent variables. Overall, there was little difference in the average precision over the range 10-25 terms. This is consistent with the results reported by ilarman in [10]. With regard to weighting functions, BM1 was slightly better than BM15. However, look- ing at individual queries, the optimal number of terms varied between three (several topics) and 31 (topic 89) with a median of 11; and BM15 was better than BM1 for 27 of the topics. Two sets of official queries and results were produced. For the cityri run, the top 20 terms were selected for each topic and the weighting function was BM1. For cityr2 the test runs were sorted for each topic by preci- sion at 30 documents within recall within average pre- cision, and the "best" combination of number of terms and weighting function was chosen. When evaluated retrospectively against the full disks 1 and 2 database the eityr2 queries were about 17% better on average precision and 10% better on recall than the cityri. The official results (first and second rows of Table 4) show a similar difference. Later, both sets of queries were repeated using BM1 1 instead of the previous weighting functions (third and fourth rows of the table). These final runs both show substantially better results than either of the official runs.