SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
3.4 Combined Database Results
Due to our inability to hold the whole of TREC ill the form used for the compression experi-
ments, and the form used for the Atlas system, we ran a further set of experiments using the
compressed database system, but using the Boolean algorithm based on concepts to test some
of the rank formulas. We performed 4 experiments where we repeated experiments using Vd,
the description vector, Vfl the narrative vector, Va the vector of all text with structure ignored,
V1 where all text was used but structure was taken into consideration, and V2, a modification
of V1 where pairs were added as an extra vector. The results are given in Table 12 using recall
and precision and then at various intervals based on the number of documents retrieved in
Table 13.
Recall [OCRerr] 10% [OCRerr] 20% [OCRerr] 30% ] 40% J 50% J 60% J 70% J 80% [OCRerr] 90% 100% Av. J
Vd 0.324 0.196 - 0.136 0.090 0.083 0.034 0.022 0.005 0.000 0.000 0.089
Vfl 0.325 0.218 0.128 0.084 0.080 0.034 0.022 0.005 0.000 0.000 0.090
VG 0.298 0.170 0.116 0.074 0.067 0.040 0.026 0.005 0.000 0.000 0.080
V1 0.372 0.235 0.129 0.092 0.083 0.048 0.022 0.005 0.000 0.000 0.099
V2 0.372 0.236 0.129 0.092 0.083 0.048 0.022 0.005 0.000 0.000 0.099
Table 12: Ranking all data
IDocuments] 5 [OCRerr] 15 [ 30 [OCRerr] 100 [OCRerr] 200 [OCRerr] Av. J
Vd 0.417 0.431 0.404 0.349 0.294 0.379
Vfl 0.370 0.408 0.407 0.353 0.301 0.368
V4 0.383 0.391 0.379 0.329 0.279 0.352
V1 0.447 0.472 0.438 0.377 0.312 0.409
V2 0.447 0.475 0.438 0.375 0.312 0.409
Table 13: Ranking all data
The results we obtained for
text. However the advantage of
4 Summary
the smaller collection are again observed with the full 2 Gb of
using the structure of the queries is slightly more pronounced.
In the first line of investigation, we built a compressed retrieval system for around 2 Gb of text
that required under 37% of the size of the input text, and was created in just over 15 CPU-hours
on a typical well-configured workstation. Retrieval performance for simple techniques such as
the cosine measure is fast, and queries can be processed effectively on low-end workstations in
seconds. The techniques developed during this first thread of investigation have thus altered
the status of 2 Gb information retrieval systems from being `unpleasantly expensive' to being
`eminently practical'.
In our second thread of research we have seen that the use of the structure of queries has
enabled Boolean queries to be generated that have few disjuncts and perform better than more
complicated ones. Given the large number of documents that are relevant to many of the
queries that have been examined, such algorithms must be subject to reduced performance
241