SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman There are two observations that one might make after examining both forms of evaluation. The first is that there appears to be a small gain obtained by treating the text in the query in a structured way using the methodology proposed here. The gain is less than 10%. The second result is both surprising and has significant ramifications. The use of adjacent pairs in previous experiments on small collections with relatively short queries provided substantial improvement in recall/precision [3, 12]. We see no such improvement in these experiments. Finally, we may compare the three Boolean algorithms in their ability to identify appropriate subsets for ranking. In this comparison, we use V4 for ranking each of the subsets. Let A[OCRerr] represent the algorithm based on concepts, Ad represent the algorithm based on the description, and A[OCRerr] represent the algorithm that uses adjacent pairs. Table 10 shows recall/precision figures and Table 11 shows precision for fixed number of documents returned. Recall 10% 20% 30 o 40 o 50% 60% 70% 80% 90% 100% Av. A[OCRerr] 0.185 0.089 0.054 0.031 0.028 0.012 0.000 0.000 0.000 0.000 0.040 Ad 0.220 0.133 0.048 0.017 0.018 0.017 0.000 0.000 0.000 0.000 0.045 A[OCRerr] 0.198 0.115 0.037 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.039 Table 10: Comparison of Boolean algorithms Av. J L D[OCRerr][OCRerr][OCRerr]ocuments A-Li 15 1301 100 200 F[OCRerr]A[OCRerr] 10.3281 0Z319 J0½306' J 0 169 0.2691 10.3491 0.332 0.311 0.222 0.149 AAPd 10.3491 0.305 0.304 0.230 0.171 00.227722 Table 11: Comparison of Boolean Algorithms for fixed no. of documents returned Using both methods of evaluation, the algorithm that uses key descriptors, without adjacent pairs outperformed augmenting the algorithm with pairs, and the algorithm using concepts. This is despite the fact that A[OCRerr] returned an average of 1190 documents per query, A[OCRerr] returned an average of 667 documents and Ad returned only an average of 530 documents. 3.3 The Atlas System The Boolean query generation experiments were performed using the Atlas database system [5J. This system is a nested relational database system designed for text applications. The system was not adapted or tuned in any way to support the TREC experiments. There are a number of signature file organisations that are supported by Atlas[OCRerr]bit slice, two level schemes, and multi-organisation schemes. We implemented an algorithm that analyses the data and selects the most appropriate scheme and optimal parameters for this scheme [11]. For the TREC collection, the multi-organisation scheme was selected. It took 23 hours to load the database and build the index, which required 313 Mb of disk space. This indexed all stopped and stemmed terms, including adjacent pairs. The text stored in the database was compressed using the mechanism described in Section 2.3. 240