SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Combination of Multiple Searches chapter E. Fox J. Shaw National Institute of Standards and Technology D. K. Harman Table 2: Collection statistics summary. Text, Dicti[OCRerr] nary and Document Vector sizes in Megabytes. Lifiectionif I____ Doc. Total Text Dict. Vectors Doc.s AP-1 266 16.0 120.2 84678 DO[OCRerr]1 190 15.9 97.9 226087 FR-i 258 15.8 53.8 26207 WSJ-1 295 16.2 124.8 98735 ZIFF-1 251 15.7 88.4 75180 Dl 1260 N/A 485.1 510887 AP-2 248 15.9 110.4 79923 FR-2 211 15.6 42.7 20108 WSJ- 2 [OCRerr] 255 16.0 105.5 74520 ZIFF- 2 918082 15.4 63.6 56920 D2 [OCRerr]______ N/A 322.2 231471 [Dl & D2 [[ 2162 [ N/A [OCRerr] 807.3 [OCRerr] 742358 ] AP-3 250 15.9 [OCRerr] 111.2 78325 PATN-3 [OCRerr] 254 15.6 31.3 6711 SJM- 3 319 16.1 114.4 90257 ZIFF-3 1318652 16.0 109.8 161021 D3 I[ [OCRerr] N/A 366.7 336314 ___ __ __[OCRerr]___ [ Total [[ 3347 [OCRerr] N/A [OCRerr] 1174.0 11078672 1 3 Retrieval 3.1 Queries All of the queries were created from the topic descrip- tions provided by NIST. Two types of queries were used, P-norm extended boolean queries and natural language vector queries. A single set of P-norm queries was cre- ated, but was interpreted multiple times with different operator weights (P-values). Two different sets of vec- tor queries were created from the topics, one contain- ing information from fewer sections of a topic descrip- tion. The Title, Description and Concepts sections of the topic descriptions were used in the creation of all three sets of queries, the Definitions section was used also in both sets of vector queries, while the P-norm query set and one of the vector query sets also con- tained information from the Narrative section of the topic descriptions. The vector query set that included the Narrative section of the topic is referred to as the long vector query set, for obvious reasons, while the other is referred to as the short vector query set. The P-norm queries were written as complex boolean expressions using AND and OR operators. Phrases were simulated using AND operators since the queries were intended only for soft-boolean evaluation. The query terms were not specifically weighted; uniform op- erator weights (P-values) of 1.0, 1.5 and 2.0 were used 244 Table 4: Summary of the five individual runs. Title Query Type J Similarity Measure sv TI Short vector Cosine similarity LV j[OCRerr] Long vector Cosine similarity Pnl.O [OCRerr] P-norm P-norm, P = 1.0 Pnl.5 Jj P-norm P-norm, P = 1.5 Pn2.O P-norm P-norm, P = 2.5 on different evaluations of the query set. 3.2 Individual Retrieval Runs The first step in our TREC-2 experiments involved de- termination of what weighting schemes would be most effective for P-norm queries. Our TREC-1 experiments with P-norm queries had obtained mixed results, per- forming poorly based on binary document term weights in our Phase I experiments and performing well for a P- value of 1.0 and very poor with larger P-values in our Phase II experiments using a tf-idf weighting scheme [4]. We performed several P-norm retrieval runs on the two AP and two WSJ training collections with topics 51 to 100 to determine the most effective term weight- ing scheme for P-norm queries with large test collec- tions. The results from these experiments are shown in Table 3 using the standard TREC-2 average non- interpolated precision and the exact R-precision mea- sures. The most effective weighting scheme turned out to be the SMART ann weighting scheme, which con- firmed the result obtained originally by Fox for the much smaller classical document collections [3]. The two sets of vector queries were evaluated us- ing the standard cosine correlation similarity method as implemented by SMART. The same SMART ann weighting scheme used for the P-norm queries was used on the vector queries for several reasons. First, a weighting scheme that did not use any collection statis- tics was needed for the routing experiments. Second, the methods used in combining runs described in the next section required a similar range of possible simi- larity values produced by each run. Finally, the neces- sity of merging results from each collection into a single set of results was simplified since the resulting similar- ity values were not based on collection statistics which would have differed for each collection. The P-norm queries were evaluated using three different P-values, again using the SMART ann weighting scheme based on specific P-norm experiments described below. The five individual runs are summarized in Table 4. The five individual runs were performed and evalu- ated for each of the nine training collections on topics 51 to 100. The results for these experiments are given in