SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
An Information Retrieval Test-bed on the CM-5
chapter
B. Masand
C. Stanfill
National Institute of Standards and Technology
D. K. Harman
We believe that this architecture suffices to implement
boolean retrieval methods9 standard IR methods, and novel
methods making use of document structure, and that it can be
scaled to very large databases and to massively parallel com-
puters.
C. The Test-bed
The above architecture has been implemented in test-bed
form on the Connection Machine model CM-S. The purpose
of this implementation is to permit various retrieval methods
to be evaluated, rather than to support on-line services on very
large files. As such, the focus was on simplicity of implemen-
tation, speed of database construction, and speed of query exe-
cution, rather than on handling files larger than a few
Gigabytes.
The CM-S [12] is a general-purpose parallel computer con-
structed from commodity microprocessors. Each processing
node consists of a SPARC microprocessor, 32 megabytes of
RAM, and a network interface. The CM-S has two user-acces-
sible networks: a Data Router, which is used for point-to-
point transmission of data packets, and a Control Network
which is used to implement global operations such as barrier
synchronization, broadcast, and global-maximum. The data
router provides each processor with 5 MB/second (full
duplex) worth of point-to-point bandwidth. The control net-
work is constructed using a fan-in/fan-out tree -[OCRerr]o that global
operations complete within a few microseconds of their imtia-
tion.
The CM-S 1/0 system consists of a high-performance mass
storage system called the scalable disk array (SDA). In this
system, disk controllers are connected directly to the CM-S's
data router. This allows all processors to have equal access to
all disks in the system, providing the image of a scalable
shared disk environment. The file system implements a UNIX
file system on top of this hardware, such that file systems may
be striped across up to 256 disks [13]. The result is a file sys-
tem which can sustain transfer rates exceeding 150 MB/sec-
ond in large configurations.
The basic approach taken in this implementation is to begin
by partitioning the document set among the processors. This is
done by having each processor read a fixed-size contiguous
chimk (1 MB) of data from the input file. In general, this will
result in some documents spanning processor boundaries, so
document fragments will then be re-distributed. Once this is
done, each processor parses and lexes its own documents,
using conventional programming techniques. The postings are
then inserted into the inverted file.
One novel implementation trick is used in this phase ofpro-
cessing: rasher than sorting the postings, which would be very
time consuming owing to the size of the uncompressed post-
ings, the database is indexed in two passes. On the first pass,
called the "dry run,', the postings are generated, counted, and
discarded. At the end of this pass, the system knows how
much space is required to store the posting list for each word.
Space is then allocated and carved up. The second "produc-
119
tion" phase then begins. The database is scanned, lexed, and
indexed again from scratch, but this time the postings are
compressed and stored into the space allocated at the end of
the dry run. This strategy doubles indexing time but, by elimi-
nating the expense of sorting the posting file, it ends up both
simplifying the software and reducing overall database con-
struction time.
At the end of this phase, the data structures noted above
(posting file, lexicon/index, and document information) have
been constructed in-memory, and the database is ready for
querying. Using these methods on a 64 processor CM-S, the
ThEC-2 training database (2.2 Gigabytes) required 20 min-
utes to index. The speed of the indexing software permits the
database to be re-indexed for every retrieval experiment,
allowing both indexing methods and query methods to be con-
veniently explored. Ignoring stop words, the size of the com-
pressed inverted file index for the [OCRerr]fl[OCRerr]C database is 24% of
the raw text. Details of the compression algorithm can be
found in [11].
The first step in query evaluation is to broadcast the query
to all processors. In a boolean system, the query would then
be processed locally by each processor, then the results pro-
duced by the processors concatenated to return the final
answer.
In an information retrieval system based on term weighting
and document ranking, slightly more work is required. First,
query-term weights are generally based on some sort of term-
frequency observations. These cannot be done locally, but
require the use of some simple global operations. For exam-
ple, to determine the number of documents in which a word
occurs, each processor would count document occurrences for
itself, then the global sum operation (supported in hardware
by the Control Network) would be used to produce a machine-
wide count. The second problem which arises comes after the
documents have been scored: an algorithm is required to
extract the highest-ranking documents in the collection. Sev-
eral parallel algorithms for this task have been described in
[14]. The test-bed used a variant on the iterative extraction
algorithm: each processor locally determines its highest-rank-
ing documents, then repeated application of the global maxi-
mum operation are used to find the best documents in the
collection.
m. ThE COSINE SIMWARi[OCRerr]TY MEASURE
Results of using the classical cosine similarity measure on
the ThEC collection have already been reported elsewhere
[15], 50 those results have not been replicated. This section
will briefly describe how the cosine measure may be imple-
mented within our architecture.
Cosine similarity measures generally involve constructing
an inverted file incorporating document-term weights. The
structured inverted file architecture does not provide this
information, partly beeause it is inconsistent with the struc-
tured representation, and partly because of the difficulties
noted above with regard to distributed databases. Except for