ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-.' ([OCRerr]
:[OCRerr]o v..::.aliz[OCRerr] tKe e::[OCRerr]ec[OCRerr][OCRerr] [OCRerr]i tor'[OCRerr].[OCRerr]-; or.[OCRerr][OCRerr]za[OCRerr]io[OCRerr] on [OCRerr]
docu::ient retri[OCRerr]ial oDeralon so::[OCRerr]e [OCRerr]e:i[OCRerr][OCRerr]t[OCRerr]onz are recuire[OCRerr]. In the base
r[OCRerr]trieval 3ystem (£ull query-doc'[OCRerr]ent sear&rling) a total 0£ N
compari3on operations are required for each input query, where N is
the number of documents in the reference collection. Let the number
0£ comparisons required per input query with a classification induced
storage organization be N . The relative search efficiency with
C
classi£ication may then be defined as the ratio N/N[OCRerr]. Further, assume
that the document set retrieved with a lImited search R is a subset
C
0£ the retrieved set R produced by a £ull search. This is a natural
consequence 0£ assuming that the retrieval criterion applied to the
query-document comparisons is the same £or both systems. Thus some
documents which would satisfy the retrieval criterion in a full search
may not be examined in the' reduced search mode and therefore cannot be
members of R . In general, then, there is a cost associated with a
c
limited search in terms 0£ documents which would be retrieved by a
£ull search but which are not retrieved by the reduced search (members
0£ the set R-R0). More precisely, [OCRerr]£ DR is the document set relevant
to the input query, the cost 0£ an increase in search efficiency is a
function 0£ the size 0£ the set DR C\' (R-R0), i.e. the number 0£
relevant documents lost.
Statistically, then, a retrieval system with a classification
induced storage organization may be characterized by an expected
(
R-R is definedas [OCRerr]d.E R
[OCRerr] R[OCRerr]
c