ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. iV-14 Consider as an example of the kind of analysis which is normally necessary for dictionary construction the concept number 101 representing the notion of t'tagt1. The word list attached to this concept originally included terms such as tcall1T, [OCRerr] "identify", "identifier", 7tidentification", "index", `tindicate", "label", "mark", t'name , point , "signal", "sign", "subscript", and "tag". The concept occurred in [OCRerr] documents out of some [OCRerr]00, with the following distribution of significant terms: Term Frequency Number of Documents index 17 7 signal (pulse) 20 identify 6 1+ All other terms under concept 101 occurred a total of 91 times, accounted for almost exclusively by the terms `1pointed out", "indicated", and Ttcallt' As a result of the analysis, the words "indicate", "call", "name", and designate" were removed from category 101 and were included in the list of common words; the words "sign" and "signal" were also removed from category 101, since they seemed to occur in the document collection only in the sense of ttpulse signal" and therefore not in the sense of "tag"; words with stem "identi" accounting for `identifier", "identification't, t1identify", etc., were moved to a new concept number representing the idea of recognition. At the end only the terms t1index", "label", t'subscript" and t'tag" remained under category 101. Performance figures which measure the efficacy of various types of dictionaries are given later in this report. Several methods of semi-