SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
70:19:desc:2:judici:proceed:1
70:19:desc:2:opinion:contract:i
70: 19:desc:2:proceed:opinion: 1
70:19 tit :2:surrog:motherhood:2
where the fields are iopic number, [OCRerr]opic len9lh (number
of terms counting repeats but not pairs), source field
(in precedence order TITLE > CONCEPTS > NAR-
RATIVE> DESCRIPTION > DEFINITIONS), num-
ber of terms, term ..., frequency of this term or pair ii'
the topic.
4.1.2 Document and query term weighting
Table 1 shows the effect of varying query term source
fields when no account is taken of within-query term
frequency.
Some tentative conclusions can be drawn: adding TI-
TLE to CONCEPTS improves most measures slightly;
TITLE alone works well in a surprising proportion of
the topics; the DESCRIPTION field is fairly harmless
used in conjunction with CONCEPTS, but NARRA-
TIVE and DEFINITIONS are detrimental. (TIME and
NATIONALITY fields, which are occasionally present,
were never used.) This really only confirms what may
be evident to a human searcher: that CONCEPTS con-
sists of search terms, but most of the other fields apart
from TITLE are instructions and guidance to relevance
assessors. A sentence such as "To be relevant, a docu-
ment must identify the case, state the issues which are
or were being decided and report at least one ethical
or legal question which arises from the case." (from
the NARRATIVE field of topic 70) can only contribute
noise.
However, when a within-query term frequency (qi])
component is used in the term weighting, the infor-
mation about the relative importance of terms gained
from the use of all or most of the topic fields seems to
outweigh the detrimental effect of noisy terms such as
"identify", "state", issues", "question". Some results
are summarised in Table 2. A number of values of k3
were tried in equation 6, and a large value proved best
overall, giving the limiting case (equation 7), in which
the term weight is simply multiplied by qtf.
Many combinations of the weighting functions dis-
cussed in Section 1.1, as well as others not described
here, were first tested on the AP and/or WSJ databases.
Some of them were eliminated immediately. The func-
tion defined as BM15 gave almost uniformly better re-
sults than t',(1), after suitable values for the constants
had been found. BM11 appeared slightly less good than
BMl5 on the small databases, but later runs on the
large databases showed that, with suitable choice of
constants, it was substantially, though not uniformly,
better. This may be a consequence of the greater varia-
25
tion in document lengths found in the large databases.
Table 3 compares the more elaborate term weighting
functions with the standard [OCRerr](1) weighting and with a
baseline coordination level run.
Some work was done on the addition of adjacent pairs
of topic terms to the queries (see Section 2.5). A num-
ber of runs were done, using several different ways of ad-
justing the "natural" weights of adjacent pairs. There
was little difference between them, and the results are
at best only slightly better than those from single terms
alone (Table 3). There was also little difference between
using all adjacent pairs and using only those pairs which
derive from the same sentence of the topic, with no in-
tervening punctuation.
4.2 Routing
Potential query terms were obtained by "indexing" all
the known relevant documents from disks 1 and 2; the
topics themselves were not used (nor were known non-
relevant documents). These terms were then given [OCRerr](1)
weights and selection values [11] given by [OCRerr]r ><[OCRerr](1) where
r and R are as in equation 1.
A large number of retrospective test runs were per-
formed on the complete disks 1 and 2 database, in which
the number of terms selected and the weighting function
were the independent variables. Overall, there was little
difference in the average precision over the range 10-25
terms. This is consistent with the results reported by
ilarman in [10]. With regard to weighting functions,
BM1 was slightly better than BM15. However, look-
ing at individual queries, the optimal number of terms
varied between three (several topics) and 31 (topic 89)
with a median of 11; and BM15 was better than BM1
for 27 of the topics.
Two sets of official queries and results were produced.
For the cityri run, the top 20 terms were selected for
each topic and the weighting function was BM1. For
cityr2 the test runs were sorted for each topic by preci-
sion at 30 documents within recall within average pre-
cision, and the "best" combination of number of terms
and weighting function was chosen. When evaluated
retrospectively against the full disks 1 and 2 database
the eityr2 queries were about 17% better on average
precision and 10% better on recall than the cityri. The
official results (first and second rows of Table 4) show
a similar difference. Later, both sets of queries were
repeated using BM1 1 instead of the previous weighting
functions (third and fourth rows of the table). These
final runs both show substantially better results than
either of the official runs.