ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
A Modified Two-Level Search Algorithm Using Request Clustering
chapter
V. R. Lesser
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
v'I-2
full search technique must be discarded. Instead, a multi-level search
algorithm based on a partial scan of the document collection can be used.
Such a scheme can be based on a clustering algorithm which uses the
information content of the documents to partition the document collection
into subsets of related documents. The following procedure can then be
used to perform a multi-level search based on the previously identified
document clusters:
1) the procedure finds those subsets of documents whose represen-
tative centroid vectors* are significantly correlated with the
given query;
2) the query is then matched against each document contained in the
subsets of documents found in step 1).
The basic assumption in the multi-level search is that for each new
query introduced into the system, the documents which are relevant to this
query are contained in only a few of the document clusters. ?Llrther, the
centroid vectors of these particular clusters will correlate more highly
with the query than the centroid vectors of the document clusters which
do not contain any documents relevant to the query. Therefore, the
efficiency of the multi-level search is dependent on the number of clusters
which contain relevant documents, and on the correlation of the query with
the centroid vectors of.relevant document clusters.
*
(let each document be represented by an n- dimensional index vector;
consider a set D of document index vectors; the centroid vector c for
the set D is defined as: ln W[OCRerr]
n i=l dii
where D = ([OCRerr]d1,d2.......