SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project
chapter
J. Yang
R. Korfhage
E. Rasmussen
National Institute of Standards and Technology
Donna K. Harman
(3) The genetic process:
After the evaluation process genetic operators are used to modify the structures of the
query variants, with the aim of improving the retrieval effectiveness of the variants.
(a) Selection: In this step those query individuals whose performances are greater than
the average performance of all the query variants are selected and retained for further
processing. Other query individuals are discarded.
(b) Reproduction: In this process the survivors from the previous step are reproduced.
In our algorithm, the reproduction algorithm developed by Baker (1987) is used. The number
of offspring for each individual depends on the ratio of its performance to the total
performance of all the survivors and the population size. Therefore, the individual survivor
with the highest performance will have the most offspring. Also, in this step, the query
variants are rearranged randomly in a sequence for later processing. We call the set of query
variants after reproduction the semi-new generation.
(c) Mutation: In the mutation step, some individuals are selected at random from the
semi-new generation, and some of their genes (i.e., the term weights) are selected and
assigned new values, also in the range [0,1]. The number of individuals to be selected
depends on the mutation ratio chosen in the experimentation. The selection of genes and the
new value assignments are also done randomly.
(d) Crossover: The crossover process mates pairs of query individuals in the semi-
new generation and exchanges parts of their term weights. Here, the two-point crossover
method is used. For each pair, two positions in the term vectors are selected randomly, and
all term weights between the two positions are exchanged for the two individuals. Thus, two
new individuals are born. However, it may be the case that not all individuals are chosen to
mate with others. The crossover ratio controls the number of mates.
After the crossover operation, a totally new generation is created and the process goes
back to step 2 and continues until a preset condition is satisfied, either a fixed number of
generations has been processed or no more relevant documents are retrieved.
The genetic retrieval algorithm is distinguished from other research using relevance
feedback: (1) Several query individual variants exist and explore the document space
simultaneously. (2) To modify the term weights, it is not required to calculate and analyze the
document term weights, at least in the current study. Query term weight modification is done
by the genetic mechanisms through the interchange of term weights between query individuals
and the introduction of new weights by the mutation operation. (3) Only term weights are
modified and no new terms are introduced and no terms are deleted. The importance of a
term is shown by the value of its weight.
35