NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman processing involved the use of logical connectives. LSI does not handle Boolean combinations of words, and sometimes returned articles covering only a subset of ANDed topics. Finally, it is not at all clear why about 20% of the false alarms were returned by LSI. Since LSI uses a statistically-derived "semantic" space and not surface-level word overlap for matching queries to documents, it is sometimes difficult to understand why a particular document was returned. One advantage of the LSI method is that documents can match queries even when they have no words in common; but this can also produce some spurious hits. Topic 066, about natural language processing technology, returned several articles about chip processing technologies and high technology products in general. Another reason for false alarms could be inappropriate word sense disambiguation. LSI queries were located at the weighted vector sum of the words, so words were "disambiguated" to some extent by the other query words. Similarly, the initial SVD analysis used the context of other words in articles to determine the location for each word in the LSI space. However, since each word has only one location, it sometimes appears as if it is "in the middle of nowhere". A related possibility concerns long articles. Lengthy articles which talk about many distinct subtopics were averaged into a single document vector, and this can sometimes produce spurious matches. Breaking larger documents into smaller subsections and matching on these might help. 3.2.2.2 Misses. For this analysis we examined a random subset of relevant articles that were not in the top 200 returned by LSI. Many of the relevant articles were fairly highly ranked by LSI, but there were also some notable failures that would be seen only by the most persistent readers. So far, we have not systematically distinguished between misses that "almost made it" and those that were much further down the list. About 40% of the misses we examined, represent articles that were primarily about a different topic than the query, but contained a small section that was relevant to the query. Because documents are located at the average of their terms in LSI space, they will generally be near the dominant theme, and this is a desirable feature of the LSI representation. Some kind of local matching should help in identifying less central themes in documents. Another 40% of the misses appear to be the result of inappropriate selection of subcollections. Recall that we analyzed 9 subcollections separately and combined the similarities later to arrive at a single ranked list. The different subcollections sometimes had different densities of documents on some topics. This is most evident when considering computer-related topics. For general collections like AP or WSJ, relatively few of the articles were about computers, and we suspect that few of the 235-250 dimensions in the LSI semantic space were devoted to distinguishing among such documents. Thus, the similarities of these documents to queries about computers were relatively high and undifferentiated. For the ZIFF collections, on the other hand, most of the LSI dimensions were used to represent differences among computer concepts. Similarities of the top few hundred articles to computer queries were lower on average, but much finer distinctions among subtopics were possible. One consequence of this was that, when combining across collections, few articles from the ZIFF subcollections were included for queries about computer-related topics! Different term weights in the different subcollections also contributed to this problem. 147