SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman The present Okapi allows adjacency searches, so a phrase that is not specifically indexed can be searched, and assigned a weight in the usual Okapi fashion as if it had been indexed. One problem with that approach is that the single words that make up the phrase will probably also be included in the query, and that suggests that a docu- ment which contains the phrase will be overweighted, as it will be given the weight assigned to the phrase in addition to the individual term weights. So in the present experiments the weight assigned to the phrase has been adjusted downwards, by deducting the weights of the constituent terms, to allow for the fact that the individual term weights have necessarily been added. Where this correction would give a negative weight to the phrase, it has been adjusted again to an arbitrary small positive number. 2.6 Weighting functions used More than 20 combinations of the weighting functions discussed above were implemented at one time or an- other. Those mentioned in this paper are listed here. For brevity, most of the functions are referred to as BMnn (Best Match). BMO: Flat, or quorum, weighting. Each term is given the same weight. 4 Automatic query processing 4.1 Ad-hoc A large number of evaluation runs have been done to investigate . the effect of query term source * the use of a query term frequency (qq) component in term weighting, and * the use of algorithmically derived term pairs. 4.1.1 Derivation of queries from the topics Topic processing was very simple. An program (writ- ten in awk) was used to isolate the required topic fields, which were then parsed and the resulting terms stemmed in accordance with the indexing procedures of the database to be searched. A small additional stop list was applied to the NARRATIVE and DESCRIPTION fields only. If required, the procedure also output pairs of adjacent terms which occur in the same subfield of the topic and with no intervening punctuation. For ex- ample the command get[OCRerr]qterms 70 trecl2[OCRerr]93 tcd pairs=1 applied to BM1: [OCRerr](1) termweights. BM15: 2-Poisson termweights as equation 3 with doc- ument length correction as equation 5. BM11: 2-Poisson termweights with document length normalisation as equation 42 3 Document processing For TREC-1 City used an elaborate 25-field structure which was intended to make all the disparate datasets on the CDs fit a unified model. It would, for exam- ple, have been possible to restrict searches to "title", "headline" etc. In the event only the TEXT was used. For TREC-2, fields which looked useful for searching were simply concatenated into one long field. For most datasets fields other than DOCNO and TEXT were ignored, but the SJM LEAD PARAGRAPH, the Ziff SUMMARY and a few additional fields from the Patents records were included. This was done using a simple pen script (in contrast to the TREC-1 conversion program which used lex, [OCRerr]acc and C). Most of the known data er- rors were handled satisfactorily, although for some rea- son there still remained a few duplicate DOCNOs from disk 1 [OCRerr]nd/or 2. 2'n theory there was also an equation 5 document length cor- reciio[OCRerr], but the best value of k2 was found to be zero. 24 <title> Topic: Surrogate Motherhood <desc> Description: Document will report judicial proceedings and opinions on contracts for surrogate mother- hood. <con> Concept(s): 1. surrogate, mothers, motherhood 2. judge, lawyer, court, lawsuit, custody, hear- mg, opinion, finding (topic 70) gave 70:19:desc:1:contract:l 70:19:con:1:court:l 70:19:con:1:custodi:1 70:19:con:1:find:1 70:19:con:1:hear:1 70:19:con:1:judg:1 70:19:desc:1 :judici:1 70:19:con:1:lawsuit:l 70:19:con:1:lawyer:l 70:19:con:1:mother:1 70:19:tit:1:motherhood:3 70:19:con:1:opinion:2 70:19:desc:1:proceed:1 70:19:tit:1:surrog:3 70:19:desc:2:contract:surrog:l