SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) An Information Retrieval Test-bed on the CM-5 chapter B. Masand C. Stanfill National Institute of Standards and Technology D. K. Harman An Information Retrieval [OCRerr][OCRerr]st-bed on the CM-5 Brij Masand and Craig Stanfill Thinking Machines Corporation 245 First Street Cambridge MA 02142 L INThODUCTION For many years, research on information retrieval was mostly conilned to a few relatively small test collections such as the Cranfield collection [1], the NPL Collection [2], and the CACM Collection [31. Over the years, results on those collec- tions accumulated, with the aim of determilting which tech- nique or combination of techniques resulted in the best precision/recall figures on those collections. Gradually, a "standard model" more-or-less emerged: for the test collec- tions under study, consistently good results are obtained by vector-model retrteval using a cosine similarity meastue, [OCRerr]idf weighting, and a stemming algorithm (e.g. Chapter 9 of [4], [5])[OCRerr]1 Out-performing this model on the old test collections has proved extremely difficult. This has led to a danger of stagnatioii in the field of IR, and a feeling that the majority of what can be learned from precision-recall experiments on the old collections has been learned. Fortunately for the field, a number of recent developments have led to new challenges for the field. Among these: * The [OCRerr]pster project [6] has led to the construction of a test collection of unprecedented size [7]. This leads to chal- lenges related to scaling up IR methods. IR methods are now being used on full documents (e.g. news stories), rather than on abstracts. This leads to new challenges relating to the structure of large documents, particularly effects relating to term proximity. * The on-line database vendors have shown interest in full- text IR. This leads to challenges relating to the integration of IR methods with traditionsl boolean methods which, in the operational environment, must continue to be sup- pcted if only for the benefit of the existing, entrenched user community. 1. Other approaches yield results which are nearly as good and. in rare cases better. Approaches based on Bayesian inference networks are particularly notable in that respect. However, the above model has come to be accepted as a de-facto standard against which new methods are measured. *Connection Machine is a registered trademark of Thinking Machines Corpo- ration. CM, CM-2, CM-5. and Datavault are trademarks of ThInking Machines Corporation. SPARC is a trademark of Sun Microsystems. 117 * Network-oriented IR protocols have become increasingly popular as a basic tool for navigating across the network. This leads to the challenge of designing Systems which give uniformly interpretable results across a distributed database. The work reported in this paper represents some steps along the way to solving these problems. In essence, the challenge we are facing is to design a system which delivers high preci- sion-recall figures on large databases of long, complex docu- ments. Furthermore, the system must be compatible with existing boolean retrieval methods, and it must be possible to use this system in a distributed database in a manner which yields consistently interpretable results. The specific steps reported at this point consist of the following: * A database architecture which is suitable for use on large collections of structured documents, which supports both base-line IR and boolean retrieval methods. * An in-memory implementation of this architecture on a massively parallel computer (the Connection Machine model CM-S), which is used as test-bed. * Precision-Recall figures for this test-bed applied to the [OCRerr]pster collection, as part of ThEC. It must be emphasized that this work is in an early stage and, at this point, has not reached the point where these meth- ods are demonstrable on extremely large databases. Further- more, work relating to taking advantage of document structure has only barely begun. II. DATABASE ARCHHECTURE A. Choice of a Representation There are three basic approaches to representing databases for retrieval applications: * Text Files. In this method, the database is stored in full- text form, and scanned sequentially. Such methods are often implemented by special-purpose hardware [8]. * Signature Files. In this family of methods, a compressed surrogate for the text file is searched instead of the text file itself. Overlap encoding is usually employed to construct the surrogate. There are many variants on signature ifies [9]. * Inverted Files. Inthis family of methods, an index con- taining the locations of every word in the database is con- structed. The search is then accomplished by reference to