MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Appendix B: Progress and Prospects in Mechanized Indexing appendix Mary Elizabeth Stevens National Bureau of Standards The automatic indexing, selective dissemination and retrieval system design developed by Ossono 43/ is based on a system vocabulary subsequently used for the automaticassign- ment of new items to appropriate locations in a pre-estabh shed `1classification space1t. An "attribute spaceT1 may also be developed to identify the kind of information found in a doc- ument, e.g., that it deals with concepts such as weight or physical size rather than with mathematical or space and time concepts. Both types of 1'space11 in this system are constructed through the use of factor analysis applied to previously established relationships between the terms in the system vocabulary (approximately 1,450 terms) and 49 subject fields and to relevance ratings of attributes with respect to items. Then, "documents are indexed by being assigned a set of coor- dinates in the classification space by means of the classification, Formula and the system vocabulary." With respect to the use of linguistic techniques in automatic indexing and classification, methods of computational linguistics may be used to derive measures of the probable significance of words in document texts. Damerau 34/ reports experimentation with word subset selection for indexing purposes based upon word occurrence frequencies signif- icantly larger than expected frequencies (following Edmundson and Wyllys, in part), with encouraging results. Findings by Black L3/[OCRerr] Simmons et al 44/, Spiegel and Bennett 38/, and Wallace 45/, among others, suggest the need for continuing investigations in the area of proper discrimination between significant clue words and non-informing words for a particular corpus or collection. Extensive computer processing and analyses such as Dennis 46/ has applied to the legal literature are needed for other subject matter fields. The latter investigator warns that neither raw word frequencies nor the numbers of doc- uments in which a word occurs provide good criteria for distinguishing between trivial or non-informing and significant or informing words. She suggests, instead, that "discrim- ination increases with the skewness of the word distribution in the file". Baxendale has suggested that certain types of phrase structures and nominal construc- tions, as determined by relatively unsophisticated machine syntactic analyses, are useful in revealing appropriate subject-content clues. A recent example is provided by Clarke and Wall 47/: "The hypothesis is that the importance of nominal constructions in selection of index unit candidates places emphasis on the bracketing of all noun phrases." Baxendale's continuing work 48/ further suggests that "through the methods of statistical decision theory it is hoped to formulate quantitative measures that will separate inform- ative index terms from noninformative. " Continuing use of syntactic analysis principles is provided as an option in the SMART system (Salton 49/) and possibilities for choosing index terms automatically by syntactic criteria have been explored by Dolby et al 35/. Closely related to automatic classification or indexing experiments involving linguistic factors are document and word grouping investigations for homograph resolution and sub- ject field identification purposes, such as those of Doyle 50/ and Wallace 45/. Doyle used a Fortran computer program developed by Ward and Hook for iterative automatic groupings of 50 physics and 50 non[OCRerr]physics documents. He was able to show clear-cut separation of two meanings of words such as "force" and "satellite". A case involving overlaps of word memberships in more than one subject class has been investigated by Wallace 45/. Using word frequency data, he found 48 words in com- mon on the first 100 word-frequency rankings for psychological and computer literature abstracts, with function words predominating. However, using a word rank sum criterion, he was able to separate 50 psychological abstracts from 50 computer abstracts with 78 percent su[OCRerr]cess. 230