SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project chapter J. Yang R. Korfhage E. Rasmussen National Institute of Standards and Technology Donna K. Harman DIS(D[OCRerr], Q) =11 D[OCRerr], Q II = (Xltik - qt[OCRerr][OCRerr])lIP k where k, (k = 1......, m), ranges over the set of query terms. Theoretically, the value of p can be in the range [1, J. In the current experiments, p is set to 2. That is, the distance measure is the Euclidean distance. Note that only terms which are present in the query are used in the formula. This means if a term is present in the document but not in the query, the term is assumed of no concern or not known by the user. In other words, a document will be considered in the retrieval process only if it has at least one term in common with those in the query. However, variations exist in the actual implementation. The keyterms in documents are unweighted in the [OCRerr]TREC databases we have used. That is, the elements in the document vectors are in binary mode, 0 or 1. This was determined for two reasons. First, for a large database of this kind, obtaining document term weights using general term weighting methods, such as the inverse document frequency (e.g., Salton and Buckley, 1988) requires counting term frequencies from all of the documents in the database, which is resource intensive for a database of this size. Second, previous experiments using the Oranfield database indicate that our system performance equally well with either weighted or unweighted document terms. Each query vector is generated from a topic provided in the TREC project. In general, a topic includes several descriptive items: the sequential number of the topic, domain, title, descriptive, narrative, concepts, factors, nationality, etc. The terms of a query vector are derived from the title and the concepts of the topic. In some cases the nationality is added, if it is important to meet the requirement of the narrative. The query vector term weights are assigned by the system at the beginning of each run. 3. The Algorithm The method we developed is based on the relevance feedback concept. However, the algorithm underlying the process is totally different from those used in other relevance feedback studies. Our system is also based on genetic algorithms which are known for their efficiency and effectiveness in searching large problem spaces to find optimal solutions (e.g., Holland, 1975; Goldberg, 1989). The basic concept behind genetic algorithms is that there exists a population of individual variants (those individuals can be various types, depending on the actual problem to be solved) in an environment, each trying to survive by exchanging information and competing with others. The genetic algorithms use methods analogous to natural evolution mechanisms to generate those individuals who provide the optimal or near optimal solution to the problem. The adaptability of genetic algorithms is due to the fact that individual selection and manipulation are adaptive to the existing environment. The environment evaluates the fitness 33