SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Retrieval of Partial Documents
chapter
A. Moffat
R. Sacks-Davis
R. Wilkinson
J. Zobel
National Institute of Standards and Technology
D. K. Harman
Documents /5 J 10 115120T 25 [OCRerr] 30 15012001
Experiment ________ I ________ lii ________ I ________ 1IJ
1 0.286 0.271 0.248 0.236 0.234 0.229 0.206 0.102
2 0.243 0.221 0.214 0.204 0.191 0.178 0.170 0.092
3 0.271 0.250 0.229 0.221 0.206 0.202 0.184 0.094
[OCRerr]4 0.329 0.257 ifi30.229 0.206 0.202 0.184[OCRerr]0.085
5 0.343 0.264 0.238 0.236 0.234 0.231 0.209 0.099
Table 2: Comparison of ranking formula for fixed number of documents returned
Documents/S Experiment ______J[OCRerr][OCRerr] 20 [OCRerr] 50__[OCRerr] 200 J
I 0.186 0.164 0.181 0.161 0.160 0.152 0.140 0.120
0.100 0.121 0.105 0.100 0.094 0.088 0.090 0.083
9£ _____I[OCRerr]f r[OCRerr]! f j
0.171 0.121 0.114 0.100 0.091 0.083 0.100 0.082
______________ 0.214 0.171 0.186 0.189 0.189 0.193 0.179 NA
Table 3: Comparison of ranking formula for fixed number of sections returned
document ranking was perhaps surprising-other
studies have indicated that considering smaller
fragments was helpful, although in combination
with larger contexts. We wondered whether the
section boundaries that had been imposed were
inappropriate. We thus applied the pagination
techniques described in Section 2.1 to the 4,000
document database used here. These results are
shown in Table 4 as Experiment 10.
These three experiments are a little difficult
to interpret. Dividing documents into sections
leads to poorer retrieval performance, but divid-
ing further into pages leads to comparable re-
trieval performance to ranking whole documents.
It may be that the manually supplied divisions
are poorer than the divisions generated by au-
tomatic techniques [5]. Experiments 2-4 show
that some of this performance degradation can
be ameliorated by taking documents' structure
into account. However, these experiments indi-
cate that there is no retrieval advantage in break-
ing the document up, should the desired unit of
retrieval be whole documents.
4 Conclusions
The combination of a restricted-accumulators
policy and the introduction of skips to the com-
pressed inverted file entries allows fast query
evaluation on large text collections. Moreover,
the ranking can be carried out within modest
amounts of main memory. For example, the
190
paged TREC collection contains 1.7 million pages,
but ranked queries of 50 or more terms can be re-
solved within seconds using just a few megabytes
of main memory. These two techniques mean
that large collections can be searched on small
machines without measurable degradation in re-
trieval effectiveness.
In the second part of the experiment we have
concentrated on large documents, breaking them
into smaller units for the purposes of indexing.
It is not clear that users are interested in re-
trieving 3 Mb documents, and these experiments
were designed to allow users the option of retriev-
ing smaller parts of such documents. The results
were mixed. It appears that indexing both sec-
tions and documents is helpful in ranking sec-
tions. However, it is not clear what an appropri-
ate indexing strategy is if only full documents are
to be returned. We are continuing our investiga-
tion of partial document retrieval.
Acknowledgements
We would like to thank Daniel Lam and Neil
Sharman for their assistance with various com-
ponents of the experiments described here. This
work was supported by the Australian Research
Council, the Collaborative Information Technol-
ogy Research Institute, and the Centre for Intel-
ligent Decision Systems.