SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project
chapter
J. Yang
R. Korfhage
E. Rasmussen
National Institute of Standards and Technology
Donna K. Harman
of the individuals, and the results form the feedback from the environment to the system
where the genetic process is applied. The efficiency of the algorithm is due to parallel
searching, with a set of individuals exploring the problem space simultaneously. Information
of better structures is stored, exchanged and transferred from generation to generation, and
combined to make individuals having all or most of the better features of their predecessors.
A few researchers have applied genetic algorithms to information retrieval. Gordon
(1988) used a genetic algorithm to modif[OCRerr] document descriptions to facilitate the retrieval of
relevant documents for a group of users who are interested in similar topics. Frieder and
Siegelmann (1991) developed a genetic algorithm to obtain an optimal allocation of
documents onto nodes in distributed multiprocessors.
In vector space models document retrieval can be seen as searching a very high-
dimensional document space to find the area(s) where most of the relevant documents to a
user query are located. We have been developing a genetic algorithm in document retrieval
through the modification of query term weights. The steps of the genetic query modification
processes can be described as follows.
(1) Generate query variants:
As described above, the genetic algorithms are applied to situations where there is a
group of individuals. To apply the techniques, there needs to be a set of query individuals.
They are generated in this step. Each query individual vector has the same elements as in the
original query (user's query). Each element within a query vector represents a weight
corresponding to the related term in the original query. The weights are assigned randomly
using a random number generator, with the values of the weights scaled to the range [0,1).
The number of query variants is set in the beginning of the experiments and remains fixed in
later processes of each run.
(2) Individual query performance evaluation:
This process evaluates the performance of each individual query variant. First, the
distances between query variants and documents are calculated, using the formula mentioned
in Section 2. Second, for each query individual, those documents whose distance values to
this query individual are less than a given threshold are viewed as relevant to the user request
and are retrieved. Finally, the retrieval performance (P) of each query individual is assessed
using the formula:
P=A -B-C
where A is the number of retrieved documents which are judged as relevant to the user
request; B is the number of retrieved documents which are not relevant; and C is the number
of relevant documents that are not retrieved by the query individual. The performance values
for the query variants are to be used in the genetic modification. (Note: the P value also can
be in the form: P = A - B, in case the C value is unknown which is normal in the actual online
retrieval process. Note also that P is the internal value used by the system and is unknown to
the user.)
34