SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
The Efficiency Issues Workshop Report
report of discussion group
James P. Callan
David D. Lewis
National Institute of Standards and Technology
D. K. Harman
2.1 Will Existing Methods Scale? 2.2 Specialized Hardware and Software
Recent trends in information retrieval are towards gi-
gabyte and terabyte document collections, retrieval by
subsections/paragraphs, and multiple representations
of document content. The first focus questions were
whether existing storage and retrieval methods can
cope with this explosion of information, and if new
methods are needed, what might they be?
Participants identified two approaches as currently
dominating IR:
1. Search all available subcollections (e.g. the TIP-
STER/TREC task), or
2. have a user specify which subcollection to search
(e.g. commercial systems).
Both approaches require modification if they are to
scale up.
One problem with the "search everything" approach
is that the growth rate of online information was felt
to exceed the growth rate in computer performance.
Even if a collection is distributed across multiple pro-
cessors, detailed consideration of every document may
be too expensive when an IR system faces a terabyte of
data. One solution is to do a fast first-pass retrieval to
produce a reduced set of documents for more detailed
consideration. This first pass might involve generat-
ing approximate scores for each document (e.g. ETH
in TREC-2), scoring documents based on a smaller
amount of text (e.g. abstracts or introductions), or
scoring cluster centroids rather than individual doc-
uments.
One problem with the "user chooses subcollections"
approach is that the task will be overwhelming when
there are many subcollections. A significant portion
of users of one commercial service already choose to
search everything rather than select subcollections.
The system will have to provide assistance if user se-
lection is to be viable. If subcollections can be charac-
terized succinctly, perhaps by centroid vectors, auto-
matically generated thesauri, or controlled vocabulary
terms (assigned manually or automatically), then one
could use the query to rank subcollections, and then
search only the to[OCRerr]ranked subcollections. Other a[OCRerr]
proaches include assistance by an expert system, or
browsing interfaces for hyperlinked subcollections.
GlobaL stati8tics that summarise some aspect of a
collection (e.g. idj) were expected to be a problem for
searching multiple subcollections and distributed doc-
ument collections. If a collection is formed at indexing
time, statistics can be gathered and saved when in-
dices are bailt. If a single processor performs retrieval,
statistics can be gathered during retrieval. However,
if multiple subcollections or processors are involved, it
is less clear how to compute global statistics. Meth-
ods that rely only on Local 8tatistica that summarize a
document (e.g. Berkeley in TREC-2) offer a computa-
tional advantage in this environment.
304
A related question posed by increasingly large and dis-
tributed text databases is whether existing hardware
and software platforms will be up to the challenge. The
consensus among participants was that conventional
architectures will suffice, because IR is a data-parallel
task that lends itseff to distributed computation. Par-
ticipants also felt that they had generally ignored is-
sues of efficiency, and could increase their speeds if
necessary. There was little support for supercomput-
ers, massively parallel computers, or speciallzed archi-
tectures.
One could argue that the participants were biased
towards conventional hardware and software by their
own need for flexibility, their small budgets, their insu-
lation from the time constraints of real users, and their
desire not to think about `systems' issues not directly
relevant to their research. However, the recent fielding
of ranked retrieval systems using conventional main-
frames by some of the largest online vendors provides
additional support for their views that conventional
architectures will suffice.