SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) An Information Retrieval Test-bed on the CM-5 chapter B. Masand C. Stanfill National Institute of Standards and Technology D. K. Harman We believe that this architecture suffices to implement boolean retrieval methods9 standard IR methods, and novel methods making use of document structure, and that it can be scaled to very large databases and to massively parallel com- puters. C. The Test-bed The above architecture has been implemented in test-bed form on the Connection Machine model CM-S. The purpose of this implementation is to permit various retrieval methods to be evaluated, rather than to support on-line services on very large files. As such, the focus was on simplicity of implemen- tation, speed of database construction, and speed of query exe- cution, rather than on handling files larger than a few Gigabytes. The CM-S [12] is a general-purpose parallel computer con- structed from commodity microprocessors. Each processing node consists of a SPARC microprocessor, 32 megabytes of RAM, and a network interface. The CM-S has two user-acces- sible networks: a Data Router, which is used for point-to- point transmission of data packets, and a Control Network which is used to implement global operations such as barrier synchronization, broadcast, and global-maximum. The data router provides each processor with 5 MB/second (full duplex) worth of point-to-point bandwidth. The control net- work is constructed using a fan-in/fan-out tree -[OCRerr]o that global operations complete within a few microseconds of their imtia- tion. The CM-S 1/0 system consists of a high-performance mass storage system called the scalable disk array (SDA). In this system, disk controllers are connected directly to the CM-S's data router. This allows all processors to have equal access to all disks in the system, providing the image of a scalable shared disk environment. The file system implements a UNIX file system on top of this hardware, such that file systems may be striped across up to 256 disks [13]. The result is a file sys- tem which can sustain transfer rates exceeding 150 MB/sec- ond in large configurations. The basic approach taken in this implementation is to begin by partitioning the document set among the processors. This is done by having each processor read a fixed-size contiguous chimk (1 MB) of data from the input file. In general, this will result in some documents spanning processor boundaries, so document fragments will then be re-distributed. Once this is done, each processor parses and lexes its own documents, using conventional programming techniques. The postings are then inserted into the inverted file. One novel implementation trick is used in this phase ofpro- cessing: rasher than sorting the postings, which would be very time consuming owing to the size of the uncompressed post- ings, the database is indexed in two passes. On the first pass, called the "dry run,', the postings are generated, counted, and discarded. At the end of this pass, the system knows how much space is required to store the posting list for each word. Space is then allocated and carved up. The second "produc- 119 tion" phase then begins. The database is scanned, lexed, and indexed again from scratch, but this time the postings are compressed and stored into the space allocated at the end of the dry run. This strategy doubles indexing time but, by elimi- nating the expense of sorting the posting file, it ends up both simplifying the software and reducing overall database con- struction time. At the end of this phase, the data structures noted above (posting file, lexicon/index, and document information) have been constructed in-memory, and the database is ready for querying. Using these methods on a 64 processor CM-S, the ThEC-2 training database (2.2 Gigabytes) required 20 min- utes to index. The speed of the indexing software permits the database to be re-indexed for every retrieval experiment, allowing both indexing methods and query methods to be con- veniently explored. Ignoring stop words, the size of the com- pressed inverted file index for the [OCRerr]fl[OCRerr]C database is 24% of the raw text. Details of the compression algorithm can be found in [11]. The first step in query evaluation is to broadcast the query to all processors. In a boolean system, the query would then be processed locally by each processor, then the results pro- duced by the processors concatenated to return the final answer. In an information retrieval system based on term weighting and document ranking, slightly more work is required. First, query-term weights are generally based on some sort of term- frequency observations. These cannot be done locally, but require the use of some simple global operations. For exam- ple, to determine the number of documents in which a word occurs, each processor would count document occurrences for itself, then the global sum operation (supported in hardware by the Control Network) would be used to produce a machine- wide count. The second problem which arises comes after the documents have been scored: an algorithm is required to extract the highest-ranking documents in the collection. Sev- eral parallel algorithms for this task have been described in [14]. The test-bed used a variant on the iterative extraction algorithm: each processor locally determines its highest-rank- ing documents, then repeated application of the global maxi- mum operation are used to find the best documents in the collection. m. ThE COSINE SIMWARi[OCRerr]TY MEASURE Results of using the classical cosine similarity measure on the ThEC collection have already been reported elsewhere [15], 50 those results have not been replicated. This section will briefly describe how the cosine measure may be imple- mented within our architecture. Cosine similarity measures generally involve constructing an inverted file incorporating document-term weights. The structured inverted file architecture does not provide this information, partly beeause it is inconsistent with the struc- tured representation, and partly because of the difficulties noted above with regard to distributed databases. Except for