ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
On Some Clustering Techniques for Information Retrieval
chapter
J. D. Broffitt
H. L. Morgan
J. V. Soden
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
ix4
2. Sjmilarity Measures
In the present system, documents and queries are represented as
vectors in n-dimensional Euclidean space, [OCRerr][OCRerr]here n is the number of
allowable concepts or index terms in the system. Documents are retrieved
on the basis of their closeness to the query vector, with `Tcloseness"
meaning small E[OCRerr]iclidean distance between the vectors. Since the document
vectors are of varying length, however, perpendicular distance at scme
fixed distance from the origin may not always be a good measure.
Normalization of the document vectors so th[OCRerr]t their endpoints lie on the
unit hypersphere, and use of the arc-length along the hypersphere as a
distance measure, removes this problem. The measures used in the present
stu[OCRerr]r are functions of this arc length through the cosine of the angle
bet[OCRerr]een two document vectors. The measures used are defined in the
following ways:
(1) Cosine measure
dd
5dd = 1 2
12 d11[OCRerr]1d2
(2) Tanimoto1s measure [4]
S = d1 [OCRerr]
dd ______________
1 2 d1[OCRerr]d1 + d *d - d *d
22 12
where d and d are document vectors and S is the similarity of
1 2 dld2
document one with document two.
Bonner uses the coefficient (2) to form his document-doc[OCRerr]nt similarity
matrices, while Rocchio uses the cosine measure to compare documents.