SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text
chapter
A. Kent
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
Donna K. Harman
There are two observations that one might make after examining both forms of evaluation.
The first is that there appears to be a small gain obtained by treating the text in the query
in a structured way using the methodology proposed here. The gain is less than 10%. The
second result is both surprising and has significant ramifications. The use of adjacent pairs
in previous experiments on small collections with relatively short queries provided substantial
improvement in recall/precision [3, 12]. We see no such improvement in these experiments.
Finally, we may compare the three Boolean algorithms in their ability to identify appropriate
subsets for ranking. In this comparison, we use V4 for ranking each of the subsets. Let A[OCRerr]
represent the algorithm based on concepts, Ad represent the algorithm based on the description,
and A[OCRerr] represent the algorithm that uses adjacent pairs. Table 10 shows recall/precision figures
and Table 11 shows precision for fixed number of documents returned.
Recall 10% 20% 30 o 40 o 50% 60% 70% 80% 90% 100% Av.
A[OCRerr] 0.185 0.089 0.054 0.031 0.028 0.012 0.000 0.000 0.000 0.000 0.040
Ad 0.220 0.133 0.048 0.017 0.018 0.017 0.000 0.000 0.000 0.000 0.045
A[OCRerr] 0.198 0.115 0.037 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.039
Table 10: Comparison of Boolean algorithms
Av. J
L D[OCRerr][OCRerr][OCRerr]ocuments A-Li 15
1301 100
200
F[OCRerr]A[OCRerr] 10.3281 0Z319 J0½306' J 0 169 0.2691
10.3491 0.332 0.311 0.222 0.149
AAPd 10.3491 0.305 0.304 0.230 0.171 00.227722
Table 11: Comparison of Boolean Algorithms for fixed no. of documents returned
Using both methods of evaluation, the algorithm that uses key descriptors, without adjacent
pairs outperformed augmenting the algorithm with pairs, and the algorithm using concepts.
This is despite the fact that A[OCRerr] returned an average of 1190 documents per query, A[OCRerr] returned
an average of 667 documents and Ad returned only an average of 530 documents.
3.3 The Atlas System
The Boolean query generation experiments were performed using the Atlas database system [5J.
This system is a nested relational database system designed for text applications. The system
was not adapted or tuned in any way to support the TREC experiments. There are a number
of signature file organisations that are supported by Atlas[OCRerr]bit slice, two level schemes, and
multi-organisation schemes. We implemented an algorithm that analyses the data and selects
the most appropriate scheme and optimal parameters for this scheme [11]. For the TREC
collection, the multi-organisation scheme was selected.
It took 23 hours to load the database and build the index, which required 313 Mb of disk
space. This indexed all stopped and stemmed terms, including adjacent pairs. The text stored
in the database was compressed using the mechanism described in Section 2.3.
240