SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman 3.4 Combined Database Results Due to our inability to hold the whole of TREC ill the form used for the compression experi- ments, and the form used for the Atlas system, we ran a further set of experiments using the compressed database system, but using the Boolean algorithm based on concepts to test some of the rank formulas. We performed 4 experiments where we repeated experiments using Vd, the description vector, Vfl the narrative vector, Va the vector of all text with structure ignored, V1 where all text was used but structure was taken into consideration, and V2, a modification of V1 where pairs were added as an extra vector. The results are given in Table 12 using recall and precision and then at various intervals based on the number of documents retrieved in Table 13. Recall [OCRerr] 10% [OCRerr] 20% [OCRerr] 30% ] 40% J 50% J 60% J 70% J 80% [OCRerr] 90% 100% Av. J Vd 0.324 0.196 - 0.136 0.090 0.083 0.034 0.022 0.005 0.000 0.000 0.089 Vfl 0.325 0.218 0.128 0.084 0.080 0.034 0.022 0.005 0.000 0.000 0.090 VG 0.298 0.170 0.116 0.074 0.067 0.040 0.026 0.005 0.000 0.000 0.080 V1 0.372 0.235 0.129 0.092 0.083 0.048 0.022 0.005 0.000 0.000 0.099 V2 0.372 0.236 0.129 0.092 0.083 0.048 0.022 0.005 0.000 0.000 0.099 Table 12: Ranking all data IDocuments] 5 [OCRerr] 15 [ 30 [OCRerr] 100 [OCRerr] 200 [OCRerr] Av. J Vd 0.417 0.431 0.404 0.349 0.294 0.379 Vfl 0.370 0.408 0.407 0.353 0.301 0.368 V4 0.383 0.391 0.379 0.329 0.279 0.352 V1 0.447 0.472 0.438 0.377 0.312 0.409 V2 0.447 0.475 0.438 0.375 0.312 0.409 Table 13: Ranking all data The results we obtained for text. However the advantage of 4 Summary the smaller collection are again observed with the full 2 Gb of using the structure of the queries is slightly more pronounced. In the first line of investigation, we built a compressed retrieval system for around 2 Gb of text that required under 37% of the size of the input text, and was created in just over 15 CPU-hours on a typical well-configured workstation. Retrieval performance for simple techniques such as the cosine measure is fast, and queries can be processed effectively on low-end workstations in seconds. The techniques developed during this first thread of investigation have thus altered the status of 2 Gb information retrieval systems from being `unpleasantly expensive' to being `eminently practical'. In our second thread of research we have seen that the use of the structure of queries has enabled Boolean queries to be generated that have few disjuncts and perform better than more complicated ones. Given the large number of documents that are relevant to many of the queries that have been examined, such algorithms must be subject to reduced performance 241