SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
The Efficiency Issues Workshop Report
report of discussion group
James P. Callan
David D. Lewis
National Institute of Standards and Technology
D. K. Harman
The Efficiency Issues Workshop Report
James P. Callan
Computer Science Department
University of Massachusetts
Box 34610
Amherst, MA 01003-4610, USA
(callan[OCRerr]cs.umass.edu)
Although most groups participating in TREC-2 em-
phasised precision and recall, the conference was also
an appropriate forum in which to discuss the efficiency
of document indexing and retrieval. For some partici-
pants, running out of disk space was their worst prob[OCRerr]
lem, while for others running out of time was. Some
groups were unable to run on the entire collection1 and
several remarked that they were happy to have gotten
anything running at all. No group reported finding
TREC trivial.
Two discussion groups were organized to address ef-
ficiency issues, one focusing on document indexing, the
other on document retrieval. Since efficiency has not
been emphasised by the research IR community, par-
ticipants in both groups felt that current algorithms
have a lot of room for improvement. TREC provides
one motivation for such improvements, as do ever-
growing real world databases. However, there was con-
cern in both groups that the TREC format, which en-
courages participation in both ad hoc (retrospective)
and routing (filtering) tasks, might discourage research
on efficient task-specific architectures.
The following sections provide more detail on each
of the two discussion groups.
1 Document Indexing
The raw text for the TREC collection (routing and
ad hoc) required approximately 3 GB of space. Index
structures required from 7.55Othemselves sufficient to
recreate the original text, so they would be additional
overhead in an operational system. (A research system
might be able to discard the original text, reporting
just document ids for evaluation.)
Several groups, including CITRI, Thinking Ma-
chines, and UMass, stored inverted lists in compressed
form. There was general agreement that for sites will-
ing to invest the programming effort, substantial space
savings could be achieved in this fashion. (CITRI
demonstrated a factor of six reduction in index file
sise.) There was more debate on the potential for in-
dex compression speeding up query processing as well,
with some participants saying their query processing
was I/O bound, but others saying theirs was CPU
bound. The peak amount of space used during index-
303
David D. Lewis
AT[OCRerr]T Bell Laboratories
600 Mountain Avenue
Room 2C409
Murray Hill, NJ 07974, USA
(1ewis[OCRerr]research.att .com)
mg (for text, indices, and auxiliary files) varied from
liOthat this may be worth more attention.
Efficiency improvements will not come immediately,
and some may require significant expense in program-
ming time. Sharing of software between groups, which
increased from TREC-1 to TREC-2, helps limit this
expense. In addition, TREC research groups primar-
ily interested in, say, query analysis, may in the future
want to team up with groups that have addressed or
are addressing issues of scale. As the sise of test col-
lections increases, it makes less sense to have them
replicated at dozens of sites, particular when interac-
tive access across networks is usually easily available.
Time to build index structures was tolerable for
most participants, though it was mentioned that it
never seems to go as easily or automatically as one
might hope. It was a serious issue for groups doing
some form of natural language processing. Times of
2 to 4 MB/hr were mentioned by at least two of the
NL groups, making TREC a very daunting task. The
opinion was expressed by some participants that an
NL technique would have to provide as yet undemon-
strated improvements in effectiveness to be worth the
slowdown in indexing and query processing.
Complex text representations, such as those pr[OCRerr]
duced by NL, require additional information in the
index structures. The different ways of dealing with
this problem are most noticeable in the handling of
phrases in the TREC, with some groups indexing on
phrases just as on words, others relying on word p[OCRerr]
sition information stored in an inverted file, and still
others reparsing the raw text of a subset of documents
at retrieval time to find phrase occurrences.
2 Document Retrieval
The second discussion group was organized around
general issues in document retrieval. Participants were
encouraged to use their experience with the one and
two gigabyte TREC collections to forecast the issues
that will arise when collections are larger and more
distributed. Two issues dominated the discussion.