SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project
chapter
J. Yang
R. Korfhage
E. Rasmussen
National Institute of Standards and Technology
Donna K. Harman
DIS(D[OCRerr], Q) =11 D[OCRerr], Q II = (Xltik - qt[OCRerr][OCRerr])lIP
k
where k, (k = 1......, m), ranges over the set of query terms. Theoretically, the value of p can
be in the range [1, J. In the current experiments, p is set to 2. That is, the distance measure
is the Euclidean distance. Note that only terms which are present in the query are used in the
formula. This means if a term is present in the document but not in the query, the term is
assumed of no concern or not known by the user. In other words, a document will be
considered in the retrieval process only if it has at least one term in common with those in the
query.
However, variations exist in the actual implementation. The keyterms in documents
are unweighted in the [OCRerr]TREC databases we have used. That is, the elements in the document
vectors are in binary mode, 0 or 1. This was determined for two reasons. First, for a large
database of this kind, obtaining document term weights using general term weighting
methods, such as the inverse document frequency (e.g., Salton and Buckley, 1988) requires
counting term frequencies from all of the documents in the database, which is resource
intensive for a database of this size. Second, previous experiments using the Oranfield
database indicate that our system performance equally well with either weighted or
unweighted document terms.
Each query vector is generated from a topic provided in the TREC project. In general,
a topic includes several descriptive items: the sequential number of the topic, domain, title,
descriptive, narrative, concepts, factors, nationality, etc. The terms of a query vector are
derived from the title and the concepts of the topic. In some cases the nationality is added, if
it is important to meet the requirement of the narrative. The query vector term weights are
assigned by the system at the beginning of each run.
3. The Algorithm
The method we developed is based on the relevance feedback concept. However, the
algorithm underlying the process is totally different from those used in other relevance
feedback studies. Our system is also based on genetic algorithms which are known for their
efficiency and effectiveness in searching large problem spaces to find optimal solutions (e.g.,
Holland, 1975; Goldberg, 1989).
The basic concept behind genetic algorithms is that there exists a population of
individual variants (those individuals can be various types, depending on the actual problem to
be solved) in an environment, each trying to survive by exchanging information and competing
with others. The genetic algorithms use methods analogous to natural evolution mechanisms
to generate those individuals who provide the optimal or near optimal solution to the problem.
The adaptability of genetic algorithms is due to the fact that individual selection and
manipulation are adaptive to the existing environment. The environment evaluates the fitness
33