SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Appendix B: System Features
Appendix
National Institute of Standards and Technology
D. K. Harman
CONSTRUCTION OF INDICES, KNOWLEDGE BASES AND OTIER DATA STRUCl'[OCRerr][OCRerr]S -- STATISTICS ON DATA STRUC'I'URES (CO]
`Es:
605.1 MB for compressed text
27.2 MB for auxiliary structures
Complete retrieval system, including index, occupies 40% of space required by original unindexed text.
4 hours for compression, plus the 4 hours for indexing; 8 hours total build time.
Combination of signatures and non-inverted document descriptions.
Experiment: topics 51-100 versus disk 3.
signatures:
non-inverted document descriptions:
normalized inverse document frequencies:
document lengths:
mapping of features to numbers:
2. Experiment: topics 101-150 versus disks 1
signatures:
non-inverted document descriptions:
normalized inverse document frequencies:
document lengths:
mapping of features to numbers:
Uncompressing and indexing:
c[OCRerr] 21.5h CPU (all collections of all 3 disks)
loading descriptions into access structure:
10 msec.Idocument
169 MB
278 MB
2.1 MB
OA MB
4.1 MB
and 2.
374 MB
618 MB
2.1 MB
OA MB
4.1 MB
6] For each feature occurring in a document description, a bit is set in the signature of the document by applying a hash function to the feature numb
signatures are used to determine an approximate RSVO. The documents are ranked according to these RSVO's. Beginning at the top of the ranked
exact RSV's are computed using the non-inverted document descriptions. It is not necessary to compute all exact RSV's because documents can ali
provided to the user as soon as their exact RSV is bigger than the RSVO of the actually regarded document.
[7] For TRW2 (statistical queries), we built a combined word frequency table, phrase frequency table (2 and 3 word phrases), and a special features fi
table. These were based on a selected subset of the training database and were used to calculate term weights. They had no direct role in the exe
the queries.