IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Test Environment
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Average Total Total
Document Non- Unique
Collection Length Collection Total Common b/a Non- c/b
Average [OCRerr]ize Word Word Common
of Occur Occur. Words
(a) (b) (c)
ADI Abs. 35 82 1+,872 2861 38.7% 1321 Lf6.2%
IRB-l LfL+ 405 31,663 17,729 56.0% 1+0[OCRerr]l 22.8%
IRE-3 49 780 68,947 38,572 55.9% 5[OCRerr]77:c 14.2%
IRE-2 56 375 37,284 20,843 55.9% 3751 18.0%
ISPRA 58 1268 131,491 73,410 55.8% 7980 10.9%
Medlars 80 276 38,958 22,023 56.5% 5331 24.2%
CRAN-2 9l[OCRerr] 1400 231,294:' 127,813* 55.3% 8887* 7.0%
CRAN-l 91 200 33,042 18,259 55.3% 3123 17.1%
ADI Txt 710 82 113,130 58,190 51.4% 7925 13.6%
* Estimated
Compar[OCRerr]ison of Word Occurrence Statistics of the
English Text in Nine Collections
Fig. 21