ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Query-Document Matching Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 4-.' ([OCRerr] :[OCRerr]o v..::.aliz[OCRerr] tKe e::[OCRerr]ec[OCRerr][OCRerr] [OCRerr]i tor'[OCRerr].[OCRerr]-; or.[OCRerr][OCRerr]za[OCRerr]io[OCRerr] on [OCRerr] docu::ient retri[OCRerr]ial oDeralon so::[OCRerr]e [OCRerr]e:i[OCRerr][OCRerr]t[OCRerr]onz are recuire[OCRerr]. In the base r[OCRerr]trieval 3ystem (£ull query-doc'[OCRerr]ent sear&rling) a total 0£ N compari3on operations are required for each input query, where N is the number of documents in the reference collection. Let the number 0£ comparisons required per input query with a classification induced storage organization be N . The relative search efficiency with C classi£ication may then be defined as the ratio N/N[OCRerr]. Further, assume that the document set retrieved with a lImited search R is a subset C 0£ the retrieved set R produced by a £ull search. This is a natural consequence 0£ assuming that the retrieval criterion applied to the query-document comparisons is the same £or both systems. Thus some documents which would satisfy the retrieval criterion in a full search may not be examined in the' reduced search mode and therefore cannot be members of R . In general, then, there is a cost associated with a c limited search in terms 0£ documents which would be retrieved by a £ull search but which are not retrieved by the reduced search (members 0£ the set R-R0). More precisely, [OCRerr]£ DR is the document set relevant to the input query, the cost 0£ an increase in search efficiency is a function 0£ the size 0£ the set DR C\' (R-R0), i.e. the number 0£ relevant documents lost. Statistically, then, a retrieval system with a classification induced storage organization may be characterized by an expected ( R-R is definedas [OCRerr]d.E R [OCRerr] R[OCRerr] c