SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
Memory requirements
In total, the peak memory requirement during query processing is about 10.9 Mb (not all
components are required to be co-resident, and, in particular, the accumulators and answer
buffer are never simultaneously present). If necessary, this could be further reduced by writing
the output text directly to disk rather than into an answer buffer. However, we do not feel that
this requirement is excessive given the large size of the document collection, and to date the
use of approximate weights is the only area where we have concentrated on reducing memory
usage.
3 Structured Queries
We focused in this experiment on structured queries. They would be transformed in two
ways: into Boolean queries that would return a given number of documents, and into a vector
of weights for the purposes of ranking. The system in which these experiments were carried
out was a nested relational database, Atlas, that uses the multi-organisation signature ifie
scheme [10]. The bulk of these experiments were carried out on the second part of the Tipster
database. At the end of this section, we wrn report on some of the experiments that we have
carried out on the whole database.
3.1 Boolean Query Generation
There are good reasons for preferring ranked retrieval over Boolean retrieval. Nevertheless,
there are two key reasons that Boolean retrieval might be used. The first is that Boolean
retrieval is computationally simpler, and the second is that the retrieval system may not support
ranked retrieval. Since both factors were relevant to the set of experiments described her[OCRerr]
we had been working with a Boolean retrieval system rather than the system described in
Section 2-it was desirable to form a Boolean query and rank the answers it returned.
There were several requirements that a Boolean query generation method needed to satisfy.
The first is that it should return a fixed number of documents. This allowed control over query
time by fixing the number of documents to be subsequently ranked. Secondly, the query should
be in disjunctive normal form, and have relatively few disjuncts. This is because query time is
almost linear in the number of disjuncts in signature file retrieval systems. Thirdly the query
generation mechanism should be automatic. These are the requirements that have previously
been used as criteria for Boolean query generation algorithms [12]. However, the queries that
we are considering in the TREC experiments have additional structure that may be used to
advantage. Despite the limited amount of text to be processed, no natural language processing
was performed; all processing was statistically and structurally based.
There were many possible approaches to Boolean query generation. A starting point
might be to use one of the Boolean query generation algorithms previously developed [12]
on the <narrative>. However, this would not adequately use the other available structures.
We considered three approaches: base the query on the <concepts>, base the query on
the <description>, augmented by the <narrative>, and base the query on the <topic>,
<description>, and <narrative>, using adjacent pairs in the disjuncts as well.
The first approach used a list of concepts. The words within the concept were ordered with
the most frequent word in the full text of the query first, and the least frequent last. Starting
at the first concept, a disjunct was formed by creating a conjunction of terms from the concept,
starting at the first word, until the conjunct returned fewer documents than the desired limit.
236