SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Combination of Multiple Searches
chapter
E. Fox
J. Shaw
National Institute of Standards and Technology
D. K. Harman
Table 2: Collection statistics summary. Text, Dicti[OCRerr]
nary and Document Vector sizes in Megabytes.
Lifiectionif I____ Doc. Total
Text Dict. Vectors Doc.s
AP-1 266 16.0 120.2 84678
DO[OCRerr]1 190 15.9 97.9 226087
FR-i 258 15.8 53.8 26207
WSJ-1 295 16.2 124.8 98735
ZIFF-1 251 15.7 88.4 75180
Dl 1260 N/A 485.1 510887
AP-2 248 15.9 110.4 79923
FR-2 211 15.6 42.7 20108
WSJ- 2 [OCRerr] 255 16.0 105.5 74520
ZIFF- 2 918082 15.4 63.6 56920
D2 [OCRerr]______ N/A 322.2 231471
[Dl & D2 [[ 2162 [ N/A [OCRerr] 807.3 [OCRerr] 742358 ]
AP-3 250 15.9 [OCRerr] 111.2 78325
PATN-3 [OCRerr] 254 15.6 31.3 6711
SJM- 3 319 16.1 114.4 90257
ZIFF-3 1318652 16.0 109.8 161021
D3 I[ [OCRerr] N/A 366.7 336314
___ __ __[OCRerr]___
[ Total [[ 3347 [OCRerr] N/A [OCRerr] 1174.0 11078672 1
3 Retrieval
3.1 Queries
All of the queries were created from the topic descrip-
tions provided by NIST. Two types of queries were used,
P-norm extended boolean queries and natural language
vector queries. A single set of P-norm queries was cre-
ated, but was interpreted multiple times with different
operator weights (P-values). Two different sets of vec-
tor queries were created from the topics, one contain-
ing information from fewer sections of a topic descrip-
tion. The Title, Description and Concepts sections of
the topic descriptions were used in the creation of all
three sets of queries, the Definitions section was used
also in both sets of vector queries, while the P-norm
query set and one of the vector query sets also con-
tained information from the Narrative section of the
topic descriptions. The vector query set that included
the Narrative section of the topic is referred to as the
long vector query set, for obvious reasons, while the
other is referred to as the short vector query set.
The P-norm queries were written as complex boolean
expressions using AND and OR operators. Phrases
were simulated using AND operators since the queries
were intended only for soft-boolean evaluation. The
query terms were not specifically weighted; uniform op-
erator weights (P-values) of 1.0, 1.5 and 2.0 were used
244
Table 4: Summary of the five individual runs.
Title Query Type J Similarity Measure
sv TI Short vector Cosine similarity
LV j[OCRerr] Long vector Cosine similarity
Pnl.O [OCRerr] P-norm P-norm, P = 1.0
Pnl.5 Jj P-norm P-norm, P = 1.5
Pn2.O P-norm P-norm, P = 2.5
on different evaluations of the query set.
3.2 Individual Retrieval Runs
The first step in our TREC-2 experiments involved de-
termination of what weighting schemes would be most
effective for P-norm queries. Our TREC-1 experiments
with P-norm queries had obtained mixed results, per-
forming poorly based on binary document term weights
in our Phase I experiments and performing well for a P-
value of 1.0 and very poor with larger P-values in our
Phase II experiments using a tf-idf weighting scheme
[4]. We performed several P-norm retrieval runs on the
two AP and two WSJ training collections with topics
51 to 100 to determine the most effective term weight-
ing scheme for P-norm queries with large test collec-
tions. The results from these experiments are shown
in Table 3 using the standard TREC-2 average non-
interpolated precision and the exact R-precision mea-
sures. The most effective weighting scheme turned out
to be the SMART ann weighting scheme, which con-
firmed the result obtained originally by Fox for the
much smaller classical document collections [3].
The two sets of vector queries were evaluated us-
ing the standard cosine correlation similarity method
as implemented by SMART. The same SMART ann
weighting scheme used for the P-norm queries was used
on the vector queries for several reasons. First, a
weighting scheme that did not use any collection statis-
tics was needed for the routing experiments. Second,
the methods used in combining runs described in the
next section required a similar range of possible simi-
larity values produced by each run. Finally, the neces-
sity of merging results from each collection into a single
set of results was simplified since the resulting similar-
ity values were not based on collection statistics which
would have differed for each collection. The P-norm
queries were evaluated using three different P-values,
again using the SMART ann weighting scheme based
on specific P-norm experiments described below. The
five individual runs are summarized in Table 4.
The five individual runs were performed and evalu-
ated for each of the nine training collections on topics
51 to 100. The results for these experiments are given in