SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
A Single Language Evaluation of a Multi-lingual Text Retrieval System
chapter
T. Dunning
M. Davis
National Institute of Standards and Technology
Donna K. Harman
-3-
in subseqent sorting. This step is also where word and document tables are created which allow
bi-directional conversion between strings and word numbers, or file references and document
numbers. The output of the relabelling consists of three files,
1) the string table, in which all strings (including file names) are kept.
2)
the document table, in which the file names, starting offset and lengths of all documents is
kepL
3) a vector of sortable posting structures.
In order to improve performance in cases where a large amount of data must be indexed.
the first two passes are often integrated. This has several benefits which primarily include a rad-
ical decrease in the number of times the word and document tables must be reread, and the sub-
stitution of flinction call/return instead of Unix pipes for the transfer of data between the tokeni-
zation and relabeling passes.
3.3. Sorting
Next, these sortable posting are sorted using a version of MSB radix sort combined with a
heap merge pass for very large files. Our radix sort uses a radix of 65536 and is able to sort at
a rate of nearly 1MB per second. Total time for the radix sort is linear in the number of ele-
ments to be sorted, up to the limit of main memory. Merging then takes e(n logm) where m is
the number of internal sorts needed. In practice, sorting is so fast that tokenization and relabel-
mg are the bottleneck even when these first two passes are integrated.
The major disadvantage of this technology is that a considerable amount of temporary disk
space is required. Ultimately, at least 100% overhead in temporary disk space is required.
Furthermore, this overhead is needed even if all that is being done is to update the indices. We
believe that paying a time penalty to avoid this overhead is probably very much warranted if a
system is to be used on large files.
3.4. Stripping
The output of the sort phase consists of a very large vector of postings. In this vector,
there are long runs in which the word number is constant. The stripping process consists of
creating and index so that the boundaries of these long runs can be found, and deleting the word
number from the postings. This change in structure results in about a 25-50% decrease in post-
ing file size (depending on how much positional information is kept), and the index allows very
fast access to the vector of postings for any particular word.
195