SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project chapter J. Yang R. Korfhage E. Rasmussen National Institute of Standards and Technology Donna K. Harman of the individuals, and the results form the feedback from the environment to the system where the genetic process is applied. The efficiency of the algorithm is due to parallel searching, with a set of individuals exploring the problem space simultaneously. Information of better structures is stored, exchanged and transferred from generation to generation, and combined to make individuals having all or most of the better features of their predecessors. A few researchers have applied genetic algorithms to information retrieval. Gordon (1988) used a genetic algorithm to modif[OCRerr] document descriptions to facilitate the retrieval of relevant documents for a group of users who are interested in similar topics. Frieder and Siegelmann (1991) developed a genetic algorithm to obtain an optimal allocation of documents onto nodes in distributed multiprocessors. In vector space models document retrieval can be seen as searching a very high- dimensional document space to find the area(s) where most of the relevant documents to a user query are located. We have been developing a genetic algorithm in document retrieval through the modification of query term weights. The steps of the genetic query modification processes can be described as follows. (1) Generate query variants: As described above, the genetic algorithms are applied to situations where there is a group of individuals. To apply the techniques, there needs to be a set of query individuals. They are generated in this step. Each query individual vector has the same elements as in the original query (user's query). Each element within a query vector represents a weight corresponding to the related term in the original query. The weights are assigned randomly using a random number generator, with the values of the weights scaled to the range [0,1). The number of query variants is set in the beginning of the experiments and remains fixed in later processes of each run. (2) Individual query performance evaluation: This process evaluates the performance of each individual query variant. First, the distances between query variants and documents are calculated, using the formula mentioned in Section 2. Second, for each query individual, those documents whose distance values to this query individual are less than a given threshold are viewed as relevant to the user request and are retrieved. Finally, the retrieval performance (P) of each query individual is assessed using the formula: P=A -B-C where A is the number of retrieved documents which are judged as relevant to the user request; B is the number of retrieved documents which are not relevant; and C is the number of relevant documents that are not retrieved by the query individual. The performance values for the query variants are to be used in the genetic modification. (Note: the P value also can be in the form: P = A - B, in case the C value is unknown which is normal in the actual online retrieval process. Note also that P is the internal value used by the system and is unknown to the user.) 34