IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Experiment in Automatic Thesaurus Construction
chapter
R. T. Dattola
D. M. Murray
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VIII-4
documents in which both occur. Then the following correlation functions may
be used to compute S:
1) cosine: [OCRerr] = N
2) Tanimoto: S..
N
L.+L.-N
1
3) overlap: [OCRerr] = N
Min(L.,L.)
1
By applying a cut-off value k to the elements of S, the similarity
matrix may be interpreted as a binary connection matrix for a graph G.
The concepts form the node vector, and node i (concept i) is connected to
node j (concept j) if and only if S.. > k.
The initial concept classes of the subcollection are specified by
the maximal complete subgraphs of G. E6] Given the node vector, the degree
vector of these nodes, and the corresponding connection matrix, the algor-
ithm for finding the maxima complete subgraphs of G is as follows:
1) pick the first node i which does not occur in any previous
initial' class and use it as the start of a new initial class;
2) for each node j connected to node i,
a) test if the degree of node j is greater than the number of
nodes in the new initial class; and
b) test if node j is connected to all other nodes in the new
class;