SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-2 Document Retrieval Experiments using PIRCS
chapter
K. Kwok
L. Grunfeld
National Institute of Standards and Technology
D. K. Harman
approach we achieve several objectives for our system: a)
reduce the `dead' time between a collection being acquired
and its availability for searching, since the full inverted file
is not produced; b) satisfy the 2N bytes of space
reqin]ement; c) support fast feedback learning and query
expansion by having the direct file available. The price we
pay is to have to create the network dynamically, unlike the
flill inverted file which is produced once. However, since
the flill inverted file is too large to reside in memory,
readidg parts of it in for retrieval is also time consuming.
3.2 Reduced Network Size
We produce our network in memory according to the
queries under attention and the terms used in them. Since
memory is limited, documents are divided into
subeollections (Section 3.3) and queries are lined in five
totenatatime. Wedefmetheactivetermset(ATS)ofa
network as the set of terms used by the current queries and
their feedback documents if any. The latter will provide
terms for query expansion. Only edges that connect items
to the active term set are initiated in the network, reducing
network space requirements substantially. With the current
implementation for 1GB subcollection and 7 queries, the
node and edge files together take about 40 MB. Using
clock time as a measure, producing the network requires
about 40 minutes. learning is fast, about 2 minutes, and
retrieval and ranking another 8 minutes. We hope that
fwiher improvements in design and faster hardware in the
future can improve these figures substantially.
3.3 Master and Subcollection File Structure
We view the `IREC experiments as document retrieval from
multiple collections, but reporting retrieval results in one
single ranked list of documents for each topic (query).
Although only four or five collections such as Wall Street
3ournal, Associated Press, etc. are given in ThEC, in reality
it could be many more. We consider three methods in file
design to approach the problem:
a) A centralized approach where all documents from all
collections are processed as if from a single document
stream, producing a centralized dictionary containing full
usage statistics and a giant direct file. From these,
networks can be inltiallzed. The idea is simple, but it has
drawbacks in that eventually the direct file would exceed
file/disk size limitations and software has to be designed to
handle data crossing file/disk boundaries. Moreover, it is
inherently fragile to create single files of this size. The
advantage of this approach is that RSVs calculated are
directly comparable for all documents and a single ranked
list is produced without difiiculty.
235
b) An independent collections approach where each source
collection of about 0.5-1.0 GB say, forms a textbase with its
own local dictionary and direct file, for network initiation
and retrieval. Gne simply repeats the process for as many
collections as necessary. This is the preferred approach,
and if one has n processors and sufficient disk space, n
separate textbases can be created for learning and retrieval
in parallel saving substantial time. The problem is how to
combine the retrieval lists from each into a single ranked
list, since each textbase has its own term usage statistics and
calculates RsVs for raithing within its own environment.
Classical Boolean retrieval and coordinate matching pose no
problem. Some retrieval strategy may produce RSVs that
are comparable across collections in theory; but after
approximations are taken, it is questionable that this is still
true. Similar problem exists for retrieval from distributed
databases such as the WMS environment.
c) For ThEC2 we settle on a hybrid subcollections
approach, treating each source as a subcollection within a
master. We create a master centralized dictionary as in a)
capturing full usage statistics serving all the subeollections,
and create separate direct files for each subeollection as in
b). The central dictionary has about 620,000 uhique terms
after processing 2 GB from Diski and Disk2, and is
relatively small. It captures global term usage statistics,
while the individual direct files capture local usage statistics
within items. Separate networks are then created for each
subeollection with edge weights based on the correct global
and local statistics as in a), assuring that retrieval lists
contain RSVs that are directly comparable. This approach
combines the advantages of both a) and b), and can also
function in a parallel distributed environment.
4. Item Representation
As in ThECi, a number of preprocessing mainly for the
purpose of improving the representations of documents and
queries are done as follows:
4.1 Vocabulary Control
In addition to a manual stopword list of about 630 words
and another 528 manually identified 2-word phrases, we
also process samples from Diski and Disk2 using all five
source types (WSJ, AP, FR, ZIFF and DOE, of about
100MB each) to produce two-word phrases based on
adjacency within sentence context. Gur objective is to
remedy losses of recall and precision due to the removal of
high frequency terms. Gur criteria for phrases is that each
word pair must have a frequency of 40, and either one or
both components must be high frequency (>=10000). A
casual scan of the resultant list generated led us to remove