ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Design Criteria for Automatic Information Systems chapter M. E. Lesk G. Salton Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. V-20 to identify documents and search requests, rather than only individual concepts alone. Thus if a given document contains the notion of "program" and the notion of "language", it might be tagged with the phrase "programing l[OCRerr][OCRerr]age". Phrases can be generated using a variety of strategies: for example, a phrase can be assigned any time the specified components co-occur in a given document, or in a given sentence of a document; alter- natively, more restrictive phrase generation methods can be used by incor- porating into the phrase generation process a syntactic recognition routine to chec[OCRerr] the syntactic compatibility between the phrase components before a phrase is actually accepted. [i[OCRerr]] In the SMART system, the normal phrase process uses a preconstructed dictionary of important phrases, and simple co-occurrence of phrase compo- nents, rather than syntactic criteria, are used to assign phrases to documents.* Phrases seem to be particularly useful as a means of incorpora- ting into a document representation, terms whose individual components are not always meaningful by themselves. For example, 1'computert' and "control" are reasonably nonspecific, while `1computer control" has a much more definite meaning in a computer science collection. The output of Fig. 9 shows that phrases tend to improve recall at some expense in initial precision. This same effect was previously noted when the abstract processing was compared with full text in Fig. 5(b); it results from the fact that the simple process is good enough to retrieve the first few relevant documents (that is, in the high precision region), while the more sophisticated procedure is important if additional relevant documents are also wanted (that is, for high recall). * Syntactic methods have, however, been used experimentally and sample results are published elsewhere. [6]