NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text chapter A. Kent A. Moffat R. Sacks-Davis R. Wilkinson J. Zobel National Institute of Standards and Technology Donna K. Harman Table 2 include the cost of writing the retrieved documents to disk for later analysis by a post-processing program, but do not include the time required by the post-processing. On average the queries required about 500 disk accesses and the transfer of 2.8 Mb of data. At (say) 100 accesses per second on the disk2, and a transfer rate of 1 Mb/sec, the input/output operations contribute about 5-8 seconds to the elapsed query time, accounting for the observed difference between CPU times and elapsed times. For small queries response was even more impressive. For a query involving four terms (`Australia wheat tariff subsidy') and retrieval of the top ten documents, 2.6 seconds of CPU time were required; and, if output was directed into a file, the total elapsed time for the query was under 4 seconds. The use of approximate lengths in-memory (using b = 8) meant that on average only 219 exact weights were required to identify the top 200 documents for the 50 TREC queries. Depending on whether the exact lengths would have been stored in memory or on disk, with the use of approximate lengths we have either reduced the memory required by over 2 Mb, at the expense of 19 disk operations; or, alternately, at a cost of 725 Kb of memory, we have avoided the need to sequentially read the entire exact lengths file, a saving of several seconds per query. We also tested using the limited precision lengths to provide the final ranking, ignoring the exact lengths altogether. This means that the cosine measure is approximate, and some variation in recall effectiveness can be expected. In our previous experiments on smaller col- lections this degradation was minimal for even very low values of b. We used the 47 partial judgments supplied, and compared the number of relevant answers using the exact ranking and the approximate ranking technique. These results are shown in Table 3. Note that relevance judgements were not provided for all documents, hence the range for precision-most queries returned some documents that were un-judged. If we pessimistically assume that these docu- ments were not relevant, the precision should be taken to be the lower value. Clearly, although the approximate ranking is finding different documents, the use of approximate ranking has, down to b = 4, little effect on precision. [ b 1 Number of answers I 1 5 [OCRerr] 15 J 30 100 200 J exact 55.7-56.2 52.1-52.3 48.8-49.6 40.7-44.1 34.2-43.3 12 55.3-55.7 51.9-52.2 48.8-49.6 40.7-44.1 34.1-43.3 10 55.3-55.7 51.8-52.1 48.9-49.7 40.6-44.1 34.1-43.2 8 54.9-55.7 51.2-51.5 48.8-49.5 40.6-44.1 34.2-43.3 6 55.7-56.6 50.8-51.3 47.5-48.4 40.6-44.5 34.1-44.0 4 57.9-59.1 50.2-50.9 46.5-48.6 38.0-45.6 31.2-48.0 2 42.6-46.4 32.6-42.4 29.9-46.6 21.3-57.7 15.7-65.8 Table 3: Effect of varying b-percentage precision, 47 queries 2One hundred accesses per second for the file containing the compressed text is optimistic for purely random accesses, since a transfer of several Kb from a random point in a large file in the Unix file system will typically require three or more seeks, each costing perhaps 10 to 20 milliseconds. However, the accesses to the document information file are in sequential order, and the accesses into the compressed text were batched into groups that were performed in sequential order. 235