SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman U [OCRerr]d the centroid of all relevant documents for some of the standard IR test collections [OCRerr]ed, Cr51, Cranfield, CACM, Time). In these cases, we found an average improvement of 107% when the query was replaced by the centroid of all relevant documents. The Improvement was 67% when the top three relevant documents were used, and 33% when just the first relevant document was used. The smaller advantages observed In JREC-2 are partially due to statistical artifacts, and partially to the fl[OCRerr]EC topics which are much richer need statements than the usual IR queries. (We also examined topic and reldocs profiles In TREC-1. Somewhat surprisingly, the query using just the topic terms was about 25% more accurate than the query using relevant documents from training. This is attributable to the small number and inaccuracy of relevance judgements in the Initial training set for [OCRerr]IREC-1. This had substantial Impact on performance for some topics because our reldocs queries were based only on the relevant articles and ignored the original topic description.) The lsirl and lsir2 runs provide baselines against which various combinations of query information and relevant document information can be measured. We have tried a simple combination of the lsirl and lsir2 profile vectors, in which both components have equal weight. That is, we took the sum of the lsirl and lsir2 profile vectors for each of the topics and used this as a profile vector. The results of this analysis are shown in the third column of the table labeled rl+r2. This combination does somewhat better than the centroid of the relevant documents in the total number of relevant documents returned and in average precision. (We returned fewer than 1000 documents for 5 of the topics and not all documents returned by the rl+r2 method had been judged for relevacce, so we suspect that performance could be improved a bit more.) For 27 of the topics, rl+r2 was better than the maximum of the other two methods. It was never more than about 10% worse than the best method. Thus it appears that this combination takes advantage of the best of both methods. The rl+r2 method which combines a query vector with a vector representing the centroid of all relevant documents is a kind of relevance feedback. This is an unusual variant of relevance feedback since all the words in relevant documents are used, words in non- relevant documents are not down-weighted, and query terms are not re-weighted. Interestingly, this method appears to produce improvements that are comparable to those obtained by Buckley, Allan and Salton (1993) using more traditional relevance feedback methods. 109 A[OCRerr][OCRerr]r[OCRerr]e preci[OCRerr]i[OCRerr] f[OCRerr] tile rl+r2 method is 31% better than for lsirl which used only the topic words (.3457 vs. .2622), and this is quite similar to the 38% improvement reported by Buckley, Allan and Salton (1993) for their richest routing query expansion method. The lsir2 method is generally better than the lsirl method, but there is substantial variability across topics. The topics on which there are the largest differences are generally those in which the cosine between the the lsirl and lsir2 topic vectors are smallest. The cosines between corresponding topic vectors range from .87 to .54. The lsir2 method is substantially better on topics: 71 (incursions by foreign military or guerrilla groups), 73 (movement of people from one country to another), 87 (criminal actions against officers of falled financial institution), 94 (crime peipetrated with the aid of a computer), 98 (production of fiber optics equipment). There are a few topics for which lsirl is substantially better than lsir2: 63 (machine translation system), 65 (information retrieval system), 85 (actions against corrupt public officials), 95 (computer application to crime solving). It is not entirely clear what distinguishes between these topics, especially topics 94 and 95, for example. We have not yet had time to look in detail at the fallures of the LSI system. We will examine both misses and false alarms in more detail. A preliminary examination of a few topics suggests that lack of specificity is the main reason for false alarms (highly ranked but irrelevant documents). This is not surprising because LSI was designed as a recall- enhancing method, and we have not added precision- enhancing tools although it would be easy to do so. We would also like to examine some query splitting ideas. We have previously conducted experiments which suggest that performance can be improved if the filter is represented as several separate vectors. We did not use this method for the IREC-2 results we submitted, but would like to do so. (See also Kane- Esrig et al., 1991 or Foltz and Dumais, 1992, for a discussion of multi-point interest profiles in LSI.) 3.5 TREC-2: Adhoc experiments We submitted two sets of adhoc queries - lsiasm and lsial. We had intended to compare the new SMART pre-processing (isiasm) and a single LsI space (`sial) with our old JREC-1 pre-processing and 9 separate subeollection spaces. Unfortunately, there were some serious errors in our translation between internal document numbers and the 4)OCNO> labels