SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
Table 2 include the cost of writing the retrieved documents to disk for later analysis by a
post-processing program, but do not include the time required by the post-processing.
On average the queries required about 500 disk accesses and the transfer of 2.8 Mb of data.
At (say) 100 accesses per second on the disk2, and a transfer rate of 1 Mb/sec, the input/output
operations contribute about 5-8 seconds to the elapsed query time, accounting for the observed
difference between CPU times and elapsed times.
For small queries response was even more impressive. For a query involving four terms
(`Australia wheat tariff subsidy') and retrieval of the top ten documents, 2.6 seconds of CPU
time were required; and, if output was directed into a file, the total elapsed time for the query
was under 4 seconds.
The use of approximate lengths in-memory (using b = 8) meant that on average only
219 exact weights were required to identify the top 200 documents for the 50 TREC queries.
Depending on whether the exact lengths would have been stored in memory or on disk, with
the use of approximate lengths we have either reduced the memory required by over 2 Mb,
at the expense of 19 disk operations; or, alternately, at a cost of 725 Kb of memory, we have
avoided the need to sequentially read the entire exact lengths file, a saving of several seconds
per query.
We also tested using the limited precision lengths to provide the final ranking, ignoring
the exact lengths altogether. This means that the cosine measure is approximate, and some
variation in recall effectiveness can be expected. In our previous experiments on smaller col-
lections this degradation was minimal for even very low values of b. We used the 47 partial
judgments supplied, and compared the number of relevant answers using the exact ranking and
the approximate ranking technique. These results are shown in Table 3. Note that relevance
judgements were not provided for all documents, hence the range for precision-most queries
returned some documents that were un-judged. If we pessimistically assume that these docu-
ments were not relevant, the precision should be taken to be the lower value. Clearly, although
the approximate ranking is finding different documents, the use of approximate ranking has,
down to b = 4, little effect on precision.
[ b 1 Number of answers
I 1 5 [OCRerr] 15 J 30 100 200 J
exact 55.7-56.2 52.1-52.3 48.8-49.6 40.7-44.1 34.2-43.3
12 55.3-55.7 51.9-52.2 48.8-49.6 40.7-44.1 34.1-43.3
10 55.3-55.7 51.8-52.1 48.9-49.7 40.6-44.1 34.1-43.2
8 54.9-55.7 51.2-51.5 48.8-49.5 40.6-44.1 34.2-43.3
6 55.7-56.6 50.8-51.3 47.5-48.4 40.6-44.5 34.1-44.0
4 57.9-59.1 50.2-50.9 46.5-48.6 38.0-45.6 31.2-48.0
2 42.6-46.4 32.6-42.4 29.9-46.6 21.3-57.7 15.7-65.8
Table 3: Effect of varying b-percentage precision, 47 queries
2One hundred accesses per second for the file containing the compressed text is optimistic for purely random
accesses, since a transfer of several Kb from a random point in a large file in the Unix file system will typically
require three or more seeks, each costing perhaps 10 to 20 milliseconds. However, the accesses to the document
information file are in sequential order, and the accesses into the compressed text were batched into groups that
were performed in sequential order.
235