SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project
chapter
J. Yang
R. Korfhage
E. Rasmussen
National Institute of Standards and Technology
Donna K. Harman
Table 2. Average retrieval time for topics 1-50
on the first dataset
[OCRerr]rnt: minutes)
Generation 0 Generation 1 Generation 2
3:56 2:19 2:05
Table 3. Total amount of storage for inverted and
indexed files -- disk one only
[OCRerr]rnt: Megabyte[OCRerr]
DOE AP ZIFF WSJ
Invertedfiles 162.3 199.8 143.7 223.4
Indexed files 3.0 2.2 2.4 2.1
Addressfiles 4.3 1.7 1.7 2.6
7. Results
This section describes several results of using the genetic algorithm in the TREC
document collection. Examples provided are from training queries (topic 1 to 50) on the
DOE database.
(1) Query convergence
In large document collections like the TREC databases, the genetic algorithm caused
the query variants to converge within 3 to 6 generations in most cases. For an example,
Table 4 shows the term weights of query individuals in the first generation, 0, and last
generation, 5, on topic 3. For most of the query terms the weights on the query individuals
converged to a single value in the final generation. Although a few variations existed, they
were caused by the mutation operation. Table 5 shows a similar situation for topic 12, where
the final generation is 4.
An interesting phenomenon is how the query term weights changed. The genetic
operators select the query individuals which have higher performance values than the average
performance of all the individuals and exchange parts of their term weights by using mutation
and crossover. Although the two operations are random, the results are interesting. The
40