SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project chapter J. Yang R. Korfhage E. Rasmussen National Institute of Standards and Technology Donna K. Harman Relevance feedback, although started more than twenty years ago, attracted more research in recent years. With advances in computer hardware and software, the feedback process, which requires more computing time and friendlier user interface than the traditional method, will dominate the information retrieval area. However, feedback techniques also need to be improved. They should include more intelligent mechanisms which embed the ability to adapt to the changing environment and handle the feedback process automatically. We have been developing an adaptive method using genetic algorithms to modify user queries automatically. Our preliminary experiments based on several simple data collections showed promising results (Yang and Korfhage, 1992a, 1992b). This algorithm was applied to the TREC project. This paper reports the experiments and some of the results. 2e The Model Our system is based on a vector space model. Within the model, both documents and queries are represented as vectors. A document vector (D[OCRerr] ) with n keywords can be represented as: = (t[OCRerr]1, t[OCRerr]2[OCRerr] t[OCRerr]3 , where (, = 1......, n) is the assigned value of the 1h keyword. Either weighted or unweighted document keywords can be used. For unweighted document keywords, tij is either 0 or 1. For weighted document keywords, the value of tij is a real number in the range [0,1]. In this model a query (Q) with m keywords is also represented in a vector format as: Q = (qtj, qt2..., qt[OCRerr]) where qt[OCRerr](k= 1,..., m) is the kth keyword in the query, where the keywords are extracted from the user natural language requests. However, no weights are assigned to query terms in the initial stage; query term weights are assigned by the system during the processing to be described later. In general, under vector space models, document retrieval is based on a similarity measurement between the query and documents. This means that documents with high similarity to the query are judged more relevant to the query and should be retrieved first. In our model, the similarity measure is based on a metric, the L[OCRerr] metric, which measures the distance between document vectors and query vectors. This means that the shorter the distance between a document and the query, the higher the similarity between them. The distance value between a document D[OCRerr] and the query Q is calculated, using the Lp metric, by the formula: 32