SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
The present Okapi allows adjacency searches, so a
phrase that is not specifically indexed can be searched,
and assigned a weight in the usual Okapi fashion as if
it had been indexed.
One problem with that approach is that the single
words that make up the phrase will probably also be
included in the query, and that suggests that a docu-
ment which contains the phrase will be overweighted,
as it will be given the weight assigned to the phrase
in addition to the individual term weights. So in the
present experiments the weight assigned to the phrase
has been adjusted downwards, by deducting the weights
of the constituent terms, to allow for the fact that the
individual term weights have necessarily been added.
Where this correction would give a negative weight to
the phrase, it has been adjusted again to an arbitrary
small positive number.
2.6 Weighting functions used
More than 20 combinations of the weighting functions
discussed above were implemented at one time or an-
other. Those mentioned in this paper are listed here.
For brevity, most of the functions are referred to as
BMnn (Best Match).
BMO: Flat, or quorum, weighting. Each term is given
the same weight.
4 Automatic query processing
4.1 Ad-hoc
A large number of evaluation runs have been done to
investigate
. the effect of query term source
* the use of a query term frequency (qq) component
in term weighting, and
* the use of algorithmically derived term pairs.
4.1.1 Derivation of queries from the topics
Topic processing was very simple. An program (writ-
ten in awk) was used to isolate the required topic
fields, which were then parsed and the resulting terms
stemmed in accordance with the indexing procedures of
the database to be searched. A small additional stop list
was applied to the NARRATIVE and DESCRIPTION
fields only. If required, the procedure also output pairs
of adjacent terms which occur in the same subfield of
the topic and with no intervening punctuation. For ex-
ample the command
get[OCRerr]qterms 70 trecl2[OCRerr]93 tcd pairs=1
applied to
BM1: [OCRerr](1) termweights.
BM15: 2-Poisson termweights as equation 3 with doc-
ument length correction as equation 5.
BM11: 2-Poisson termweights with document length
normalisation as equation 42
3 Document processing
For TREC-1 City used an elaborate 25-field structure
which was intended to make all the disparate datasets
on the CDs fit a unified model. It would, for exam-
ple, have been possible to restrict searches to "title",
"headline" etc. In the event only the TEXT was used.
For TREC-2, fields which looked useful for searching
were simply concatenated into one long field. For most
datasets fields other than DOCNO and TEXT were
ignored, but the SJM LEAD PARAGRAPH, the Ziff
SUMMARY and a few additional fields from the Patents
records were included. This was done using a simple pen
script (in contrast to the TREC-1 conversion program
which used lex, [OCRerr]acc and C). Most of the known data er-
rors were handled satisfactorily, although for some rea-
son there still remained a few duplicate DOCNOs from
disk 1 [OCRerr]nd/or 2.
2'n theory there was also an equation 5 document length cor-
reciio[OCRerr], but the best value of k2 was found to be zero.
24
<title> Topic: Surrogate Motherhood
<desc> Description:
Document will report judicial proceedings and
opinions on contracts for surrogate mother-
hood.
<con> Concept(s):
1. surrogate, mothers, motherhood
2. judge, lawyer, court, lawsuit, custody, hear-
mg, opinion, finding
(topic 70)
gave
70:19:desc:1:contract:l
70:19:con:1:court:l
70:19:con:1:custodi:1
70:19:con:1:find:1
70:19:con:1:hear:1
70:19:con:1:judg:1
70:19:desc:1 :judici:1
70:19:con:1:lawsuit:l
70:19:con:1:lawyer:l
70:19:con:1:mother:1
70:19:tit:1:motherhood:3
70:19:con:1:opinion:2
70:19:desc:1:proceed:1
70:19:tit:1:surrog:3
70:19:desc:2:contract:surrog:l