SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
processing involved the use of logical connectives. LSI does not handle Boolean combinations
of words, and sometimes returned articles covering only a subset of ANDed topics.
Finally, it is not at all clear why about 20% of the false alarms were returned by LSI. Since LSI
uses a statistically-derived "semantic" space and not surface-level word overlap for matching
queries to documents, it is sometimes difficult to understand why a particular document was
returned. One advantage of the LSI method is that documents can match queries even when they
have no words in common; but this can also produce some spurious hits. Topic 066, about
natural language processing technology, returned several articles about chip processing
technologies and high technology products in general. Another reason for false alarms could be
inappropriate word sense disambiguation. LSI queries were located at the weighted vector sum
of the words, so words were "disambiguated" to some extent by the other query words.
Similarly, the initial SVD analysis used the context of other words in articles to determine the
location for each word in the LSI space. However, since each word has only one location, it
sometimes appears as if it is "in the middle of nowhere". A related possibility concerns long
articles. Lengthy articles which talk about many distinct subtopics were averaged into a single
document vector, and this can sometimes produce spurious matches. Breaking larger documents
into smaller subsections and matching on these might help.
3.2.2.2 Misses. For this analysis we examined a random subset of relevant articles that were
not in the top 200 returned by LSI. Many of the relevant articles were fairly highly ranked by
LSI, but there were also some notable failures that would be seen only by the most persistent
readers. So far, we have not systematically distinguished between misses that "almost made it"
and those that were much further down the list.
About 40% of the misses we examined, represent articles that were primarily about a different
topic than the query, but contained a small section that was relevant to the query. Because
documents are located at the average of their terms in LSI space, they will generally be near the
dominant theme, and this is a desirable feature of the LSI representation. Some kind of local
matching should help in identifying less central themes in documents.
Another 40% of the misses appear to be the result of inappropriate selection of subcollections.
Recall that we analyzed 9 subcollections separately and combined the similarities later to arrive
at a single ranked list. The different subcollections sometimes had different densities of
documents on some topics. This is most evident when considering computer-related topics. For
general collections like AP or WSJ, relatively few of the articles were about computers, and we
suspect that few of the 235-250 dimensions in the LSI semantic space were devoted to
distinguishing among such documents. Thus, the similarities of these documents to queries
about computers were relatively high and undifferentiated. For the ZIFF collections, on the
other hand, most of the LSI dimensions were used to represent differences among computer
concepts. Similarities of the top few hundred articles to computer queries were lower on
average, but much finer distinctions among subtopics were possible. One consequence of this
was that, when combining across collections, few articles from the ZIFF subcollections were
included for queries about computer-related topics! Different term weights in the different
subcollections also contributed to this problem.
147