ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
I
I
4-41
d[OCRerr]tail[OCRerr]for' each query the following parameters could be produced:
1.) The total number of documents in the union of the
retrieved categories
2.) The overlap correlation of the category retrieved subset
with the first 15 and first 30 documents retrieved by a
full'search; (The[OCRerr]over'lap correlation between sets A
an'd B is `defined by `n(A(\B)/minimum(n(A),n(B)).)
3.) The category recall or percentage of relevant documents
in the' category retrieved subset to the total number of
relevant documents.
4'.) The normal recall or percentage of relevant documents
retrieved t6 thb total number of relevant documents,
assufliing the `same total number of documents retrieved as
contained in the category retrieved subset.
It should be' no'te'd that this method of evaluating the
classific'ation based' sear'cb is somewhat unfair'on two cQunts. First,
it does not consider `the correlation distribution of the search
requests with the category v"ectors. Thus when a query has high
correlation with' only'one or two category vectors, only these should
be searched. Some queries, however, will not correlate very well with
any of the category vectors; and:,in this case, one should expect to
hav'e to `search a' larger number of categories in detail to do as well
as a full search. `Queries [OCRerr]f this latter' type in effect do not fit
the' classification s'truc'tur'e: Second; the degree of association
betweeneach classifica"t[OCRerr]on' vector and the documents it represents
(as reflected by Figure 4.10 is sufficiently small such that a wide