SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
An Information Retrieval Test-bed on the CM-5
chapter
B. Masand
C. Stanfill
National Institute of Standards and Technology
D. K. Harman
An Information Retrieval [OCRerr][OCRerr]st-bed on the CM-5
Brij Masand and Craig Stanfill
Thinking Machines Corporation
245 First Street
Cambridge MA 02142
L INThODUCTION
For many years, research on information retrieval was
mostly conilned to a few relatively small test collections such
as the Cranfield collection [1], the NPL Collection [2], and the
CACM Collection [31. Over the years, results on those collec-
tions accumulated, with the aim of determilting which tech-
nique or combination of techniques resulted in the best
precision/recall figures on those collections. Gradually, a
"standard model" more-or-less emerged: for the test collec-
tions under study, consistently good results are obtained by
vector-model retrteval using a cosine similarity meastue, [OCRerr]idf
weighting, and a stemming algorithm (e.g. Chapter 9 of [4],
[5])[OCRerr]1 Out-performing this model on the old test collections
has proved extremely difficult. This has led to a danger of
stagnatioii in the field of IR, and a feeling that the majority of
what can be learned from precision-recall experiments on the
old collections has been learned.
Fortunately for the field, a number of recent developments
have led to new challenges for the field. Among these:
* The [OCRerr]pster project [6] has led to the construction of a test
collection of unprecedented size [7]. This leads to chal-
lenges related to scaling up IR methods.
IR methods are now being used on full documents (e.g.
news stories), rather than on abstracts. This leads to new
challenges relating to the structure of large documents,
particularly effects relating to term proximity.
* The on-line database vendors have shown interest in full-
text IR. This leads to challenges relating to the integration
of IR methods with traditionsl boolean methods which, in
the operational environment, must continue to be sup-
pcted if only for the benefit of the existing, entrenched
user community.
1. Other approaches yield results which are nearly as good and. in
rare cases better. Approaches based on Bayesian inference networks
are particularly notable in that respect. However, the above model
has come to be accepted as a de-facto standard against which new
methods are measured.
*Connection Machine is a registered trademark of Thinking Machines Corpo-
ration. CM, CM-2, CM-5. and Datavault are trademarks of ThInking Machines
Corporation.
SPARC is a trademark of Sun Microsystems.
117
* Network-oriented IR protocols have become increasingly
popular as a basic tool for navigating across the network.
This leads to the challenge of designing Systems which
give uniformly interpretable results across a distributed
database.
The work reported in this paper represents some steps along
the way to solving these problems. In essence, the challenge
we are facing is to design a system which delivers high preci-
sion-recall figures on large databases of long, complex docu-
ments. Furthermore, the system must be compatible with
existing boolean retrieval methods, and it must be possible to
use this system in a distributed database in a manner which
yields consistently interpretable results. The specific steps
reported at this point consist of the following:
* A database architecture which is suitable for use on large
collections of structured documents, which supports both
base-line IR and boolean retrieval methods.
* An in-memory implementation of this architecture on a
massively parallel computer (the Connection Machine
model CM-S), which is used as test-bed.
* Precision-Recall figures for this test-bed applied to the
[OCRerr]pster collection, as part of ThEC.
It must be emphasized that this work is in an early stage
and, at this point, has not reached the point where these meth-
ods are demonstrable on extremely large databases. Further-
more, work relating to taking advantage of document structure
has only barely begun.
II. DATABASE ARCHHECTURE
A. Choice of a Representation
There are three basic approaches to representing databases
for retrieval applications:
* Text Files. In this method, the database is stored in full-
text form, and scanned sequentially. Such methods are
often implemented by special-purpose hardware [8].
* Signature Files. In this family of methods, a compressed
surrogate for the text file is searched instead of the text file
itself. Overlap encoding is usually employed to construct
the surrogate. There are many variants on signature ifies
[9].
* Inverted Files. Inthis family of methods, an index con-
taining the locations of every word in the database is con-
structed. The search is then accomplished by reference to