NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 chapter C. Buckley J. Allan G. Salton National Institute of Standards and Technology D. K. Harman Name Num Description Pairs I Query sentences vs. doc sentences II Query paragraphs vs. doc paragraphs III Entire query vs. doc paragraphs Which a Top matching pair b Non-zero matching pairs c Pairs where similarity exceeds threshold d All pairs Value 1 Similarity (avg) 2 Number of common terms (avg) 3 Top matching term (avg) 4 Count of pairs Table 3: Local values considered for LSP weighting (all combinations, choosing one from each category) Future work We are currently investigating the use of re- gression analysis to find correlation between relevance and local similarity values. Using such analysis will allow the local values to be selected for cause rather than solely because of experience and intuition. If successful, it will also provide a collection-independent method of selecting which local values are useful. Note that this approach does require a training set of queries and relevance judgements. We are interested in applying these tech- niques to the TREC collections with a more useful definition of "paragraph." [17, 2] sug- gest the possibility of narrowing the search window to fixed-size pieces, ignoring para- graph boundaries. Hearst's "TextTiling" ap- proach ([7]) is intriguing for the topic-coherent units of text it produces. Routing In this work, routing queries are formed in two distinct phases. In the first phase, con- cepts which occur often in relevant documents are added to the original query to expand the vocabulary used. In the second phase, the original concepts plus the added concepts are weighted based upon their occurrences in rel- evant and non-relevant documents. In TREC 1, query expansion was a major obstacle. It was clear that only very limited expansion was useful, and indeed the best au- tomatic routing run ([5]) used no expansion at 50 all. Thus the original plans for TREC 2 rout- ing included extensive investigation into very selectively adding concepts to queries. However, as work on TREC 2 progressed it became obvious that the TREC 1 results were somewhat anomalous. For the routing approaches used in this work, selectivity of added terms is not an issue. Rather, the more terms that are added, the better the result up to a point of diminishing returns. This re- sult agrees with our experiences on the (small) feedback test collections that we have worked with in the past. The original TREC 1 train- ing data for routing was extremely sketchy and the resulting unusual query expansion re- sults were probably due to the lack of infor- mation about what a representative relevant document looked like. The basic routing approach chosen is the feedback approach of Rocchio ([9,13]). Ex- pressed in vector space terms, the final query vector is the initial query vector moved to- ward the centroid of the relevant documents, and away from the centroid of the non-relevant documents. Q new = A*Q0ld + B * average[OCRerr]wt[OCRerr]n[OCRerr]rel[OCRerr]docs - C * average[OCRerr]wt[OCRerr]nonrel[OCRerr]docs Terms that end up with negative weights are dropped (less than 3% of terms were dropped in the most massive query expansion below). The parameters of Rocchio's method are the relative importance of the original query, the