ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Consideration for Time Shared Automatic Documentation Centers
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
X-l~
document sets created by the partitioning. Documents might be included
in more than one cluster. This would however, be slightly wasteful of
storage space.
To derive tIming estimates, let us assume disjoint document clusters, -
so that we have approximately n key vectors representing n clusters of
250,000/n documents each. We should expect that about two or three
clusters are searched per request, but for safety's sake, let us assume
that five clusters are searched per request. If we assume that five
correlations can be performed per millisecond, the total time required
for internal operations is
n/5 + 250,000/n msec or about 250/n seconds.
The time required for external operations is the data cell access time
(-2[OCRerr] second) and read time for each cluster. Since each cluster contains
250,000/n documents and each document consists of 2000 bit8, processed
at 7x105 bits per second, *each cluster will require about 750/n seconds
to read in. If five clusters are to be read, and one expects an average
of two in each data cell, the total read time would be 2(0.5+750.n).
Reasonable values for n would thus be n = 500 or n = 1000 which would
allow the complete search to be performed in 3 to 5 seconds. Considering
the small amount of additional work (sorting the correlations and
applying the cutoff and other restrictions), it is clear that 10 seconds
should suffice for the complete process, and that five seconds would be
a more likely bound.
This assumes that no competition exists from other programs in memory
for the data cell sections needed by SM[OCRerr]T. Since the data cell sections