ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
(
4-5
Fignre 4.2 illustrates this process for two Boolean queries an[OCRerr] the
collection [OCRerr]escribed in Figure 4.1(a).
Another of the operan[OCRerr] structures useful for document and
query representations is the [OCRerr]-dimensional cartesian vector. Table
4.2 characterizes some of the vector comparison operations of interest.
Equality, as in the case of set represented operands, is too
restrictive a criterion for selecting source documents in response
to an input query. The vector difference assigns a vector quantity
to each query-document pair, but its magnitude could be a useful
matching criterion. In most cases, however, and particularly in the
case of the index images derived by a frequency counting technique
(see Chapter 2), the information in the vector image of interest is
contained in the relative magnitude of its components rather than in
their absolute magnitudes. This results from the direct dependence
of the absolute magnitude on the number of words in the input text.
With this assumption, the angular distance function provides the most
suitable matching operation for vector structured information
representations.
Data representations with structures considerably more
complex than set or vector operands have also been considered for
2
automatic document retrieval systems. Hierarchical arrays, tree
structures,3 and abstract graphs,4 are among these. With information
representations of these types, matching operations are considerably
more complex than those described above (see for example Sussenguth,
reference 4, for a detailed account of graph matching procedures).
The price paid, then, for the additional information which can be