SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
documents, it is sometimes difficult to understand why
a particular document was returned. One advantage of
the LSI method is that documents can match queries
even when they have no words in common; but this
can also produce some spurious hits. Another reason
for false alarms could be inappropriate word sense
disambiguation. LsI queries are located at the
weighted vector sum of the words, so words are
"disambiguated" to some extent by the other query
words. Similarly, the inltial SYD analysis used the
context of other words in articles to determine the
location for each word in the LSI space. However,
smce each word has only one location, it sometimes
appears as if it is "in the middle of nowhere". A
related possibility concerns long articles. Lengthy
articles which talk about many distiiict subtopics were
averaged into a single document vector, and this can
sometimes produce spurious matches. Breaking larger
documents into smaller subsections and matching on
these might help.
4.2.2 Misses.
For this analysis we will examine a random subset of
relevant articles that were not in the top 1000 returned
by LSI. Many of the relevant articles were fairly
highly ranked by LSI, but there were also some
notable failures that would be seen only by the most
persistent readers. So far, we have not systematically
distinguished between misses that "almost made it"
and those that were much finher down the list.
Most of the misses we examined, represent articles
that were primarily about a different topic than the
query, but contained a small section that was relevant
to the query. Because documents are located at the
average of their terms in LsI space, they will
generally be near the dominant theme, and this is a
desirable feature of the LSI representation. Some kind
of local matching should help in identifying less
central themes in documents.
Some misses were also attributable to poor text (and
query) pre-processing and tokenizatio[OCRerr]
4.3 Open issues
On the basis of preliminary failure analyses we would
like to exploring some precision-enhancing methods.
We would also like to explore three additional areas.
4.3.1 Separate vs. combined scaling
We used 9 separate subscalings for the [OCRerr]IREC-l
experiments. For TREC-2 we used a single scaling
113
(based on a very small sample). We have also
recentiy finished a complete scaling and will compare
this with the subeollection scalings and the sampled
full scaling.
4.3.2 Centroid query vs. many separate
points of interest
A single vector was used to represent each query. "1
some cases the vector was the average of terms in the
topic statement, and in other cases the vector was the
average of previously identified relevant documents.
A single query vector can be inappropriate if interests
are multifacted and these facets are not near each
other in the LsI space. We have developed techniques
that allow us to match using a controllable
compromise between averaged and separate vectors
[OCRerr]ane-Fsrig et al., 1991). In the case of the routing
queries, for example, we could match new documents
against each of the previously identified relevant
documents separately rather than against their
average.
4.3.3 Interactive interfaces
Ml LSI evaluations were conducted using a non-
interactive system in essentially batch mode. It is well
known that one can have the same underlying retrieval
and matching engine, but achieve very different
retrieval success using different interfaces. We would
like to examine the performance of real users with
interactive interfaces. A number of interface features
could be used to help users make faster (and perhaps
more accurate) relevance judgements, or to help them
explicitiy reformulate queries. (See Dumais and
Schmitt, 1991, for some preliminary results on query
reformulation and relevance feedback.) Another
interesting possibility involves retning something
richer than a rank-ordered list of documents to users.
For example, a clustering and graphical display of the
top-k documents might be quite useful. We have done
some preliminary experiments using clustered return
sets, and would like to extend this work to the TREC
collections.
The general idea is to provide people with useful
interactive tools that let them make good use of their
knowledge and skills, rather than attempting to build
all the smarts into the database representation or
matching components of the system.