IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Correlation Measures chapter K. Reitsma J. Sagalyn Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. concept number vectors, if a that document, present. IV-3 one, the third position concept three, etc.). Thus for binary o occurs in the third position, concept three is absent from and if a 1 occurs in the third p[OCRerr][OCRerr]ition, concept three is The second type of vector is a weighted vector. Id[OCRerr]ally, the posi- tions in the vector have the same interpretation [OCRerr]s for binary [OCRerr]ectors. The difference is that the value in each posi4ion is 0 if the concept is not found in the document, or some integer j where j > 0 is proportional to the number of times the concept appears in the document. In the SMART system a weight of 12 is given to concept k if concept k occurs once iri the document, 2[OCRerr] if k occurs twice, etc Since approximately 6oo concepts occur in the thesaurus, each document description vector would normally have a length of 600 positions. To reduce the memory space needed, only those concepts with non-zero weights are retained in the vector, the concept number and weight both being packed into the same memory location. The use of a correlation coefficient in the two systems poses some problems. A coefficient defined for binary vectors may have a specific interpretation, either logical or statistical. However, the same coefficient used with weighted vectors may lose its former interpretation. For example, given the two vectors V = (1,1,0,1,0,1,0) w (1,0,1,1,0,1,1) the expression vw I -i-i i (1) is interpreted as the nwnber of matching terms or the number of concepts