SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) The Efficiency Issues Workshop Report report of discussion group James P. Callan David D. Lewis National Institute of Standards and Technology D. K. Harman The Efficiency Issues Workshop Report James P. Callan Computer Science Department University of Massachusetts Box 34610 Amherst, MA 01003-4610, USA (callan[OCRerr]cs.umass.edu) Although most groups participating in TREC-2 em- phasised precision and recall, the conference was also an appropriate forum in which to discuss the efficiency of document indexing and retrieval. For some partici- pants, running out of disk space was their worst prob[OCRerr] lem, while for others running out of time was. Some groups were unable to run on the entire collection1 and several remarked that they were happy to have gotten anything running at all. No group reported finding TREC trivial. Two discussion groups were organized to address ef- ficiency issues, one focusing on document indexing, the other on document retrieval. Since efficiency has not been emphasised by the research IR community, par- ticipants in both groups felt that current algorithms have a lot of room for improvement. TREC provides one motivation for such improvements, as do ever- growing real world databases. However, there was con- cern in both groups that the TREC format, which en- courages participation in both ad hoc (retrospective) and routing (filtering) tasks, might discourage research on efficient task-specific architectures. The following sections provide more detail on each of the two discussion groups. 1 Document Indexing The raw text for the TREC collection (routing and ad hoc) required approximately 3 GB of space. Index structures required from 7.55Othemselves sufficient to recreate the original text, so they would be additional overhead in an operational system. (A research system might be able to discard the original text, reporting just document ids for evaluation.) Several groups, including CITRI, Thinking Ma- chines, and UMass, stored inverted lists in compressed form. There was general agreement that for sites will- ing to invest the programming effort, substantial space savings could be achieved in this fashion. (CITRI demonstrated a factor of six reduction in index file sise.) There was more debate on the potential for in- dex compression speeding up query processing as well, with some participants saying their query processing was I/O bound, but others saying theirs was CPU bound. The peak amount of space used during index- 303 David D. Lewis AT[OCRerr]T Bell Laboratories 600 Mountain Avenue Room 2C409 Murray Hill, NJ 07974, USA (1ewis[OCRerr]research.att .com) mg (for text, indices, and auxiliary files) varied from liOthat this may be worth more attention. Efficiency improvements will not come immediately, and some may require significant expense in program- ming time. Sharing of software between groups, which increased from TREC-1 to TREC-2, helps limit this expense. In addition, TREC research groups primar- ily interested in, say, query analysis, may in the future want to team up with groups that have addressed or are addressing issues of scale. As the sise of test col- lections increases, it makes less sense to have them replicated at dozens of sites, particular when interac- tive access across networks is usually easily available. Time to build index structures was tolerable for most participants, though it was mentioned that it never seems to go as easily or automatically as one might hope. It was a serious issue for groups doing some form of natural language processing. Times of 2 to 4 MB/hr were mentioned by at least two of the NL groups, making TREC a very daunting task. The opinion was expressed by some participants that an NL technique would have to provide as yet undemon- strated improvements in effectiveness to be worth the slowdown in indexing and query processing. Complex text representations, such as those pr[OCRerr] duced by NL, require additional information in the index structures. The different ways of dealing with this problem are most noticeable in the handling of phrases in the TREC, with some groups indexing on phrases just as on words, others relying on word p[OCRerr] sition information stored in an inverted file, and still others reparsing the raw text of a subset of documents at retrieval time to find phrase occurrences. 2 Document Retrieval The second discussion group was organized around general issues in document retrieval. Participants were encouraged to use their experience with the one and two gigabyte TREC collections to forecast the issues that will arise when collections are larger and more distributed. Two issues dominated the discussion.