ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
A Modified Two-Level Search Algorithm Using Request Clustering
chapter
V. R. Lesser
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
v'I-25
D) [OCRerr]valuation Results
The improvement in search efficiency by query clustering can be
observed in cases 1-6 in Table 1. In all of these cases, the search
efficiency as measured by,[OCRerr] indicates that the modified two-level search
based on query clustering is significantly better than the normal two-level
search scheme based on document clustering. The reasons for this improvement
in search efficiency can be explained by Table 2: the classification vectors
of the categories constructed by query clustering are more highly correlated
with the test queries; and they more naturally classify the test query to
one particular category. This is indicated by the large differences between
the first and second highest correlating classification vectors. These
results provide an experL'nental validation of the theoretical advantages of
query clustering as illustrated by Figure 1. Unfortunately, the other two
criteria [OCRerr]T' [OCRerr]T [OCRerr] contradict the general feeling that the higher the
query-document correlations (and therefore the larger the value of
the greater the probability of retrieving relevant documents (and therefore
the larger the value of R[OCRerr]). A positive conclusion based on all three
criteria for search effectiveness is thus impossible. Still, it is evident
that case 5, `Yhich is an example of the modified t[OCRerr][OCRerr]-ievel search scheme,
is superior to the two examples of the normal two-level search scheme; the
values of and [OCRerr]T for case 5 are much better than for case 1 and case
and the differences in the values of 1% for these three cases are small.
The apparent contradiction caused by differences between the values of
P and R[OCRerr] can be resolved if the evaluation results are based only on
T
requests which retrieve more than three documents. It appears that for
requests which retrieve only 3 documents, high overlap be[OCRerr]een the first 3