SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) The Efficiency Issues Workshop Report report of discussion group James P. Callan David D. Lewis National Institute of Standards and Technology D. K. Harman 2.1 Will Existing Methods Scale? 2.2 Specialized Hardware and Software Recent trends in information retrieval are towards gi- gabyte and terabyte document collections, retrieval by subsections/paragraphs, and multiple representations of document content. The first focus questions were whether existing storage and retrieval methods can cope with this explosion of information, and if new methods are needed, what might they be? Participants identified two approaches as currently dominating IR: 1. Search all available subcollections (e.g. the TIP- STER/TREC task), or 2. have a user specify which subcollection to search (e.g. commercial systems). Both approaches require modification if they are to scale up. One problem with the "search everything" approach is that the growth rate of online information was felt to exceed the growth rate in computer performance. Even if a collection is distributed across multiple pro- cessors, detailed consideration of every document may be too expensive when an IR system faces a terabyte of data. One solution is to do a fast first-pass retrieval to produce a reduced set of documents for more detailed consideration. This first pass might involve generat- ing approximate scores for each document (e.g. ETH in TREC-2), scoring documents based on a smaller amount of text (e.g. abstracts or introductions), or scoring cluster centroids rather than individual doc- uments. One problem with the "user chooses subcollections" approach is that the task will be overwhelming when there are many subcollections. A significant portion of users of one commercial service already choose to search everything rather than select subcollections. The system will have to provide assistance if user se- lection is to be viable. If subcollections can be charac- terized succinctly, perhaps by centroid vectors, auto- matically generated thesauri, or controlled vocabulary terms (assigned manually or automatically), then one could use the query to rank subcollections, and then search only the to[OCRerr]ranked subcollections. Other a[OCRerr] proaches include assistance by an expert system, or browsing interfaces for hyperlinked subcollections. GlobaL stati8tics that summarise some aspect of a collection (e.g. idj) were expected to be a problem for searching multiple subcollections and distributed doc- ument collections. If a collection is formed at indexing time, statistics can be gathered and saved when in- dices are bailt. If a single processor performs retrieval, statistics can be gathered during retrieval. However, if multiple subcollections or processors are involved, it is less clear how to compute global statistics. Meth- ods that rely only on Local 8tatistica that summarize a document (e.g. Berkeley in TREC-2) offer a computa- tional advantage in this environment. 304 A related question posed by increasingly large and dis- tributed text databases is whether existing hardware and software platforms will be up to the challenge. The consensus among participants was that conventional architectures will suffice, because IR is a data-parallel task that lends itseff to distributed computation. Par- ticipants also felt that they had generally ignored is- sues of efficiency, and could increase their speeds if necessary. There was little support for supercomput- ers, massively parallel computers, or speciallzed archi- tectures. One could argue that the participants were biased towards conventional hardware and software by their own need for flexibility, their small budgets, their insu- lation from the time constraints of real users, and their desire not to think about `systems' issues not directly relevant to their research. However, the recent fielding of ranked retrieval systems using conventional main- frames by some of the largest online vendors provides additional support for their views that conventional architectures will suffice.