SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Query Improvement in INformation Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project
chapter
J. Yang
R. Korfhage
E. Rasmussen
National Institute of Standards and Technology
Donna K. Harman
Relevance feedback, although started more than twenty years ago, attracted more
research in recent years. With advances in computer hardware and software, the feedback
process, which requires more computing time and friendlier user interface than the traditional
method, will dominate the information retrieval area. However, feedback techniques also
need to be improved. They should include more intelligent mechanisms which embed the
ability to adapt to the changing environment and handle the feedback process automatically.
We have been developing an adaptive method using genetic algorithms to modify user
queries automatically. Our preliminary experiments based on several simple data collections
showed promising results (Yang and Korfhage, 1992a, 1992b). This algorithm was applied to
the TREC project. This paper reports the experiments and some of the results.
2e The Model
Our system is based on a vector space model. Within the model, both documents and
queries are represented as vectors. A document vector (D[OCRerr] ) with n keywords can be
represented as:
= (t[OCRerr]1, t[OCRerr]2[OCRerr] t[OCRerr]3 ,
where (, = 1......, n) is the assigned value of the 1h keyword. Either weighted or
unweighted document keywords can be used. For unweighted document keywords, tij is
either 0 or 1. For weighted document keywords, the value of tij is a real number in the range
[0,1].
In this model a query (Q) with m keywords is also represented in a vector format as:
Q = (qtj, qt2..., qt[OCRerr])
where qt[OCRerr](k= 1,..., m) is the kth keyword in the query, where the keywords are extracted from
the user natural language requests. However, no weights are assigned to query terms in the
initial stage; query term weights are assigned by the system during the processing to be
described later.
In general, under vector space models, document retrieval is based on a similarity
measurement between the query and documents. This means that documents with high
similarity to the query are judged more relevant to the query and should be retrieved first. In
our model, the similarity measure is based on a metric, the L[OCRerr] metric, which measures the
distance between document vectors and query vectors. This means that the shorter the
distance between a document and the query, the higher the similarity between them. The
distance value between a document D[OCRerr] and the query Q is calculated, using the Lp metric, by
the formula:
32