SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project chapter J. Yang R. Korfhage E. Rasmussen National Institute of Standards and Technology Donna K. Harman 4. Preprocessing of Documents and Queries As in most information retrieval Systems, the documents and topics in the TREC project need to be preprocessed before they can be used by our algorithm. (1) Document processing: Document retrieval in our system is based on keyword match. In current system, a keyword' is a single word. For the documents, to facilitate the retrieval we need to create inverted files. The processes for creating the inverted files can be described as follows. (a) Keyword extraction: Keywords are extracted for each document. The stopwords (total 2,529 stopwords on the list) are deleted, and keywords are stemmed using the Porter stemming algorithm which was implemented by Fox [OCRerr]rakes, 1992). The document number from which a keyword is extracted is also stored with that keyword, for it will be used to create inverted files and later to facilitate the retrieval of the full text of the document. Also an "address file" is generated which indicates for each document the offset of its location related to the original file within which the document is stored. (b) Creation of inverted files: An inverted file is created for each database to facilitate the retrieval process. The file is organized as: a keyword followed by a list of document numbers from which the keyword extracted. Three-level indexed files are generated for the inverted file to reduce the search time. Each index file include keywords, their offsets in the inverted file and the offset in the immediately higher level file (for the second and third levels). (2) Query processing: Queries are generated from the topics provided on the TREC project. For each topic a single query vector (the original query) is generated, which consists of a list of keywords from the title and the concepts of the topic. For some queries information about the nationality is also included if it is necessary to satisfy the requests from the narrative descriptions of the related topics. The stemming algorithm is also applied to the terms in the query. For the training queries we did not add any terms from other descriptive items on the topics. But for some ad hoc queries, we added several keywords from the narrative because we thought that those keywords would be useful to identify relevant documents. The routing queries are those query individuals from the last generation of the training queries. Ten query individual vectors were generated for each original query. On the training queries the initial query term weights for the query individuals were assigned randomly, and then modified by the genetic algorithm. The final query individual vectors from the training set were used as routing queries with no weights being modified. 1Keyword, keyterm and term Will be used interchangeably in this paper. 36