IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Test Environment
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
1-42
Information is given in Fig. 21 comparing nine collections on the
basis of word occurrences. A standard list of 204 common words is used
in each case to isolate the total non-common words and total unique non-
common words. It may be noted that in seven of the collections, the pro-
portion of non-common to total word occurrences is between 55.3% and 56.5%;
even the two ADI collections are not far outside this range. The proportion
of unique (or distinct) non-common words to total non-common word occurrences
varies both with document length and collection size. For example, if the
collections are divided into the six having 82-405 documents, and the
three having 780-1400 documents, the unique-to-total proportion (c/b) varies
directly with average document length within the two groups. The one small
exception is the Medlars collection, but the abundance of technical names
in medicine may be the cause. Although further analysis could be done, the
data in Fig. 21 suggests that the common factor of English text provides
strong uniformity in the statistics given irrespective of subject area.
This does not directly confirm or reject the subject language precision
ideas, since ambiguity is not reflected in any of the statistics given.
A retrieval performance plot comparing results:;from three col-
lections is given in Fig. 22. The type of dictionary used is the automatic
stem procedure, sin[OCRerr]e use of thesaurus dictionaries would introduce the
additional element of varying human skills in thesaurus construction.
Furthermore, many variables exist due to request preparation and relevance
decisions between the collections; the extent to which these variables af-
fect the result is not known. It can be suggested however, that the super-
iority of the computer science collection and the inferiority of the