SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) An Information Retrieval Test-bed on the CM-5 chapter B. Masand C. Stanfill National Institute of Standards and Technology D. K. Harman the cosine measure's document normalization factor, we find that it is possible to implement document-term weighting based only in information in the structured posting file. in general, term weights (both in queries and documents) may be computed based on the following factors [5], all of which are available at run-time in our architecture. * Term Frequency. This is the number of times a term occursin the database. This information is retaieed in the lexicon. * Document-Term Frequency. This is the number of docu- ments in which a term occurs. It may be computed at run- time by traversing posting list for the term, counting initial occurrences of the term. * In-Document Frequency. This is the number of times a term occursin a given document. This may be computed at run-time by traversing the posting list for the term, count- ing the number of occurrences of the term having the same document coordinate. * Maximum In-Document Frequency. This is the maxi- mum number of times a given term occurs in any docu- menL It may be computed direedy from the in-document frequencies. Most conventional weighting schemes (e.g. tf.idf) may be computed from these run-time factors. The advantage of doing these computations at run-time is that it eliminates the need to incorporate collection-specific information into the database at indexing time. This is important for dynainic col- lections as well as for distributed databases, as described above. The difficulty with these methods when applied to the cosine norm is that the total-document-weight term cannot be conveniently computed at run-time using an inverted file. It must, therefore, be computed at index time and stored sepa- rately (e.g. in the document-specific information data struc- ture). There are two solutions to this difficulty. The first is to insist on using the cosine norm, accepting the difficulties that this implies. The second is to look for alternatives to the cosine norm. We believe that this is the more promising approach, but this is in the realm of work-not-yet-completed. in any event, the value of the cosine norm for large structured documents has not been established at this time. W RFIRIEVAL EXPERIMENTS: Most of the time (about 3 to 4 person months) for our [OCRerr]I[OCRerr]EC participation was spent on building the test bed. How- ever we did finlsh some runs with automatically constructed routing and acilloc queries for which we report the results here. The experiments were done using the entire collection. We compare words and phrases, mixed case vs. lower case and also explore document length normalization and proximity measures. 120 A. Retrieval for Routing Topics: All queries were constructed and optimized automatically. The terms consisted of words (ignoring stop words) as well as all phrases consisting of adjacent words. Numbers were ignored and case was preserved. No stemming or query expansion with thesauri etc. was attempted. Routing queries were constructed by loolting at each word and (adjacent) phrase from the whole text of the topic tem- plates and determing a weight based on the number of rele- vant documents present in the first 100 retrieved documents, by using just that term. An initial weighted query was con- structed for each topic by the above process. Then each topic query was optimized by choosing thresholds (per topic) for the weights and rejecting all weighted query terms below the threshold. The optimum threshold for each topic was chosen by straightforward incremental search. Table 1 shows the results for the routing experiments. Routing queries with both Table 1: Routing Queries Precisionat Average Method 100 docs precision tm[OCRerr]- routing-words- phrases .3396 .2553 tmc7-routing-words .2920 .2045 tm[OCRerr]-routing-words- phrases-ip .3750 .2716 tm[OCRerr]-routing-words- phrases-doc-length- sent-prox .3782 .2792 tm[OCRerr]-routing-words- phrases-ip-sent-prox .3856 .3344 weighted words and phrases (queryid tmc6) did better than queries using just words (queryid tmc7). Using the same (offi- cial) queries, but adding sentence level proximity (sent-prox), document length scaling (doc-length) and inverse weights based on which paragraph the term appears in (ip), seem to improve results (see the next section for more details about the techniques). B. Re frieval for Adhoc Topics Adhoc queries were automatically constructed by using words and phrases from different sections of the topic tem- plates and using tf.idf weights (as derived from the training collection). The "best" sections for the new topics were ch[OCRerr] sen by experimenting with the training topics. Queries derived from the description[OCRerr]oncept sections were used for most of the experiments. A threshold for weights was used to select terms for the final queries. Table 2 shows the results for the adhoc queries.