ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
iV-14
Consider as an example of the kind of analysis which is normally
necessary for dictionary construction the concept number 101 representing
the notion of t'tagt1. The word list attached to this concept originally
included terms such as tcall1T, [OCRerr] "identify", "identifier",
7tidentification", "index", `tindicate", "label", "mark", t'name , point ,
"signal", "sign", "subscript", and "tag". The concept occurred in [OCRerr]
documents out of some [OCRerr]00, with the following distribution of significant
terms:
Term Frequency Number of Documents
index 17 7
signal
(pulse) 20
identify 6 1+
All other terms under concept 101 occurred a total of 91 times, accounted
for almost exclusively by the terms `1pointed out", "indicated", and Ttcallt'
As a result of the analysis, the words "indicate", "call", "name", and
designate" were removed from category 101 and were included in the list
of common words; the words "sign" and "signal" were also removed from
category 101, since they seemed to occur in the document collection only
in the sense of ttpulse signal" and therefore not in the sense of "tag";
words with stem "identi" accounting for `identifier", "identification't,
t1identify", etc., were moved to a new concept number representing the idea
of recognition. At the end only the terms t1index", "label", t'subscript"
and t'tag" remained under category 101.
Performance figures which measure the efficacy of various types of
dictionaries are given later in this report. Several methods of semi-