NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Multilevel Ranking in Large Text Collections Using FAIRS chapter S-C. Chang H. Dediu H. Azzam M-W. Du National Institute of Standards and Technology Donna K. Harman number 1, and assume the following settings for the vari- able attributes: FREQUENCY = neutral (s[1] =0) POPULARITY = neutral (s[2] =0) WORD[OCRerr]LOC = neutral (s[3] =0) REC_ID = neutral (s[4] =0) REC_SIZE = posiave (s[5] = 1) The contributing factors to the weight of a retrieved docu- ment are IMPORTANCE and REC_SIZE. Because of its large size (a[5] ť 0,) record I will be irrelevantly retrieved in response to a large percentage of pending que- nes. To avoid the above scenario, method 2 uses the following formula to compute the normalized weight of a retrieved record r at level 1: K [OCRerr] A W1[r] = [OCRerr] e(i] aills U]I [OCRerr]tl L JX=1[OCRerr]iCj Where: e[i) = 1 if keyword i exists in the record and 0 other- wise, ail] = The value of the attributej, A = Number of attributes (currently 6,) K = Total number of query keywords, sill = 1 if the value of attributej is positive, 0 if the value of attributej is neutral, -1 if the value of attributej is negative. C = Number of attributes whose value is not 0. amill = The maximum value of the attributej. To avoid the effect of disproportionately large attribute values, al/Jiamill is set to 1 if a[,] is one of the largest 2% attribute values. Method 2 has the advantage of accommo- dating large attribute values by normalizing with respect to their maximum. However, this method does not automatically calculate the coverage since sill contributes as a multiplicand rather than as an exponent as in Method 1. Therefore, if the attributes are all neutral (level 1 in Table 1--default attributes,) the weight would be 0, evan though query terms would occur in the record r. We are therefore still looking for ways to improve the method. 333 3.0 Experiments The TREC92 collection of text and topics was used to quantify and analyze the performance of FAIRS in the domain of very large collections. The entire collection includes 2.3Gb of text from various sources and of various formats. 3.1 System Configuration The hardware platform used for indexing and query pro- cessing was an IBM RS/6000 320 workstation with AIX 3.0 operating system. The core memory (RAM) installed was 32 Mb. The net disk space available was 4,169,728Kb. The CPU clock rate was 25 MHz, with a MIPS rating of 27. The disk access time was 9.8 ms on average. 3.2 Indexing Performance Our experiments showed that pefformance in indexing is strictly constrained by I/O wait. We used several tech- niques to reduce this constraint and optimize throughput. We scheduled index runs to run simultaneously, thus keep- ing the CPU busy during I/O wait. We used a kernel which employs automatic disk caching (providing an order of magnitude improvement in indexing time.) To further exploit caching, large amounts of high-speed core memory should be used (our hinit of 32 Mb is by no means ideal.) The I/O bottleneck also implies that the very fastest disks be used. We chose 9.8 ms disks as the fastest available for the platform for a reasonable price. Using the above con- figuration, the sustained indexing throughput was on the order of 10 days (240 hrs.)/Gb. The storage overhead for index output was 110%. That is, for 1 Gb of text, 2.1 Gb of space should be available before indexing is initiated. By this measure, the entire TREC92 collection of 2.3 Gb would require 4.83 Gb of storage. Since only 4.17 Gb was available, we participated in category B (Wall Street Jour- nal only.) 3.3 Query processing Before submitting the raw topics to FAIRS, they were con- verted automatically into FAIRS-compatible queries of the form described in section 2.1.4. The TREC92 topics were pre-processed by a simple syntactic term token generator. The pre-processor removed stop words, stemmed and removed cases from topic terms. It used an ad-hoc term weighting scheme to assign IMPORTANCE weights to terms according to their positions in the topic (e.g. terms occurring in the title section were assigned an arbitrarily higher importance than those in other sections.)