IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Test Environment chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 1-42 Information is given in Fig. 21 comparing nine collections on the basis of word occurrences. A standard list of 204 common words is used in each case to isolate the total non-common words and total unique non- common words. It may be noted that in seven of the collections, the pro- portion of non-common to total word occurrences is between 55.3% and 56.5%; even the two ADI collections are not far outside this range. The proportion of unique (or distinct) non-common words to total non-common word occurrences varies both with document length and collection size. For example, if the collections are divided into the six having 82-405 documents, and the three having 780-1400 documents, the unique-to-total proportion (c/b) varies directly with average document length within the two groups. The one small exception is the Medlars collection, but the abundance of technical names in medicine may be the cause. Although further analysis could be done, the data in Fig. 21 suggests that the common factor of English text provides strong uniformity in the statistics given irrespective of subject area. This does not directly confirm or reject the subject language precision ideas, since ambiguity is not reflected in any of the statistics given. A retrieval performance plot comparing results:;from three col- lections is given in Fig. 22. The type of dictionary used is the automatic stem procedure, sin[OCRerr]e use of thesaurus dictionaries would introduce the additional element of varying human skills in thesaurus construction. Furthermore, many variables exist due to request preparation and relevance decisions between the collections; the extent to which these variables af- fect the result is not known. It can be suggested however, that the super- iority of the computer science collection and the inferiority of the