SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2
chapter
C. Buckley
J. Allan
G. Salton
National Institute of Standards and Technology
D. K. Harman
Name Num Description
Pairs I Query sentences vs. doc sentences
II Query paragraphs vs. doc paragraphs
III Entire query vs. doc paragraphs
Which a Top matching pair
b Non-zero matching pairs
c Pairs where similarity exceeds threshold
d All pairs
Value 1 Similarity (avg)
2 Number of common terms (avg)
3 Top matching term (avg)
4 Count of pairs
Table 3: Local values considered for LSP weighting
(all combinations, choosing one from each category)
Future work
We are currently investigating the use of re-
gression analysis to find correlation between
relevance and local similarity values. Using
such analysis will allow the local values to be
selected for cause rather than solely because of
experience and intuition. If successful, it will
also provide a collection-independent method
of selecting which local values are useful. Note
that this approach does require a training set
of queries and relevance judgements.
We are interested in applying these tech-
niques to the TREC collections with a more
useful definition of "paragraph." [17, 2] sug-
gest the possibility of narrowing the search
window to fixed-size pieces, ignoring para-
graph boundaries. Hearst's "TextTiling" ap-
proach ([7]) is intriguing for the topic-coherent
units of text it produces.
Routing
In this work, routing queries are formed in
two distinct phases. In the first phase, con-
cepts which occur often in relevant documents
are added to the original query to expand the
vocabulary used. In the second phase, the
original concepts plus the added concepts are
weighted based upon their occurrences in rel-
evant and non-relevant documents.
In TREC 1, query expansion was a major
obstacle. It was clear that only very limited
expansion was useful, and indeed the best au-
tomatic routing run ([5]) used no expansion at
50
all. Thus the original plans for TREC 2 rout-
ing included extensive investigation into very
selectively adding concepts to queries.
However, as work on TREC 2 progressed
it became obvious that the TREC 1 results
were somewhat anomalous. For the routing
approaches used in this work, selectivity of
added terms is not an issue. Rather, the more
terms that are added, the better the result
up to a point of diminishing returns. This re-
sult agrees with our experiences on the (small)
feedback test collections that we have worked
with in the past. The original TREC 1 train-
ing data for routing was extremely sketchy
and the resulting unusual query expansion re-
sults were probably due to the lack of infor-
mation about what a representative relevant
document looked like.
The basic routing approach chosen is the
feedback approach of Rocchio ([9,13]). Ex-
pressed in vector space terms, the final query
vector is the initial query vector moved to-
ward the centroid of the relevant documents,
and away from the centroid of the non-relevant
documents.
Q new = A*Q0ld
+ B * average[OCRerr]wt[OCRerr]n[OCRerr]rel[OCRerr]docs
- C * average[OCRerr]wt[OCRerr]nonrel[OCRerr]docs
Terms that end up with negative weights are
dropped (less than 3% of terms were dropped
in the most massive query expansion below).
The parameters of Rocchio's method are the
relative importance of the original query, the