SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman Inappropriate subcollection selection accounts for many of LSI's failures on computer-related topics. Consider, for example, topic 037 about IBM's SAA standards. LSI performs very poorly compared to other systems on this topic, returning the fewest relevant articles (19) and having the lowest 11-pt precision (.0088). Summing over all Systems for this topic, 68% of the total number of returned articles (554/812) and 99% of the relevant articles (444/449) were from ZIFF2. For LSI, however, only 17 of the top200 articles (8%) were from ZIFF2; all 17 of these articles were relevant. When a comparable proportion of LSI documents were selected from ZIFF2, the number relevant increased to 175 and the 11-pt precision increased to .3532. This performance places LSI slightly above the median for topic 037. We performed the same analysis for all topics in which more than 33% of the total returned articles were from ZIFF. There were 26 such topics (19 Adhoc Topics from 026-050, and 7 Routing Topics). For these topics, the mean percent of ZIFF articles chosen by all systems was 59%, compared with 9% for LSI. When comparable proportions of ZIFF articles were selected for LSI, the average number of relevant documents increased from 26 to 58, and the 11-pt precision increased from .0554 to .1111. A total of 700 new relevant documents were found for the 19 Adhoc Topics, and 125 new relevant documents were found for the 7 Routing Topics. Performance improvements were observed for 23 of the 26 topics - two of those in which there were no improvements were about AT&T products where poor pre-processing omitted AT&T from the LSI query (see below). While these results are encouraging, the problem of how to select appropriate subcollections is not solved. For the routing topics, we could use training data to set some apriori mixture of articles from various subcollections. This strategy is not, however, generally applicable for adhoc queries. We will examine some more appropnate way of combining across subcollections to take distributional effects like this into account. Alternatively, we could use randomly selected documents (rather than topically organized ones) to create the subcollections. Finally, we could use a single large combined scaling in which there would be no need to combine across subcollections. Finally, some misses were attributable to poor text (and query) pre-processing and tokenization. Since we do not keep single letters or use a database of company and location names, several important acronyms like U.S. and AT&T disappeared completely from both articles and queries! Not surprisingly, this resulted in many missed documents. We noticed that many of the top performing automatic systems used SMART's pre-processing, and we hope to do so as well for TREC-2. This will allow us to better understand the usefulness of LSI per se without many of the additional confounds introduced with different indexing. 3.3 Open experimental issues The results of the failure analyses suggest several directions to pursue for TREC-2, including: improving pre-processing and tokenization; exploring some precision-enhancing methods; and developing methods for more effecfively combining across subcollections. We also hope to explore three additional. 148