ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval On Some Clustering Techniques for Information Retrieval chapter J. D. Broffitt H. L. Morgan J. V. Soden Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. ix4 2. Sjmilarity Measures In the present system, documents and queries are represented as vectors in n-dimensional Euclidean space, [OCRerr][OCRerr]here n is the number of allowable concepts or index terms in the system. Documents are retrieved on the basis of their closeness to the query vector, with `Tcloseness" meaning small E[OCRerr]iclidean distance between the vectors. Since the document vectors are of varying length, however, perpendicular distance at scme fixed distance from the origin may not always be a good measure. Normalization of the document vectors so th[OCRerr]t their endpoints lie on the unit hypersphere, and use of the arc-length along the hypersphere as a distance measure, removes this problem. The measures used in the present stu[OCRerr]r are functions of this arc length through the cosine of the angle bet[OCRerr]een two document vectors. The measures used are defined in the following ways: (1) Cosine measure dd 5dd = 1 2 12 d11[OCRerr]1d2 (2) Tanimoto1s measure [4] S = d1 [OCRerr] dd ______________ 1 2 d1[OCRerr]d1 + d *d - d *d 22 12 where d and d are document vectors and S is the similarity of 1 2 dld2 document one with document two. Bonner uses the coefficient (2) to form his document-doc[OCRerr]nt similarity matrices, while Rocchio uses the cosine measure to compare documents.