NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 chapter C. Buckley J. Allan G. Salton National Institute of Standards and Technology D. K. Harman x Y A B C P number of singi'e terms to add (possible values 0 to 500) number of phrases to add (0 to 100) relative importance of original query (fixed at 8) relative importance of average weight in relevant documents (4 to 48) relative importance of average weight in non-relevant documents (0 to 16) relative importance of phrases in final retrieval as compared to single terms (0, 0.5, or 1.0) Table 4: Parameters of routing Just re-weighting the query terms according to Rocchio's algorithm gives a 7% improve- ment. Adding a few terms (20 single terms + 10 phrases) gives 17% improvement over the base case, and expanded by 350 (300+50) terms results in a 38% improvement. The official run crnlCi is actually a bit dis- appointing. It only results in a 3% improve- ment over the crnlRl run, which is not very significant considering the effort required. Few people are going to keep track of 158 test runs on a per query basis. It may be practical to keep track of 4 or so main query variants, but then the improvement would probably be less than 3%. We are conducting experiments in this area currently. An open question is the effectiveness of varying the feedback approach itself between queries. Preliminary experiments using Fuhr's RPI ([3]) weighting schemes in addition to the Rocchio variants show larger improvements. In general, RPI (and the other probabilistic models) perform noticeably better than Roc- chio if there is very little query expansion, though quite a bit worse under massive ex- pansion. We expect that the combination of RPI for those queries with little expansion and Rocchio for other queries will work well. One benefit of the CrR1C1 run not entirely represented by the evaluation figures is that retrieval performance is more even. Poten- tial mismatches between feedback method and query are far less likely. crulCi does reason- ably on all the queries (above the median sys- tem for every query when compared against the other systems). Routing Implementation and Timing The original routing queries are automatically indexed from the query text, and weighted us- 52 mg the `[OCRerr]ltc" weighting scheme (equation (1)). Collection frequency information used for the idf factors is gathered from D12 documents only. Relevance information about potential query terms is gathered and stored on a per query basis. For each query, statistics (includ- ing relevant and non-relevant frequency and total "ltc" weights) are kept about the 1000 most frequently occurring terms in the D12 relevant documents. For TREC 2, this is done by a batch run taking about 90 CPU minutes. In practice, this would be done incrementally as each document was compared to the query and judged. The statistics amounted to about 40,000 bytes per query. Using these statistics, and the decided upon parameters for the feedback process (A, B, etc.), actual construction of the final query takes about 0.5 seconds per query. Retrieval times vary tremendously with length of query. We ran in batch mode, con- structing an inverted file for the entire D3 document set ("lnc" document weights) and then comparing a query against that inverted file. Not only is this not what would be done in practice, but it is much less efficient than would be done in practice given our massive expansion of queries: for each query in cruiRi, well over half the entire inverted file was read! CPU time per query ranged from about 5 sec- onds (no expansion) to 65 seconds (expansion by 500 terms). Conclusion No firm conclusions can be reached regarding the usefulness of combining local and global similarities in the TREC environment. In some limited circumstances minor improve- ments can be obtained, but in general we have not (yet!) been able to take advantage of the local information we know should be useful.