SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Appendix B: System Features Appendix National Institute of Standards and Technology D. K. Harman CONSTRUCTION OF INDICES, KNOWLEDGE BASES AND OTIER DATA STRUCl'[OCRerr][OCRerr]S -- STATISTICS ON DATA STRUC'I'URES (CO] `Es: 605.1 MB for compressed text 27.2 MB for auxiliary structures Complete retrieval system, including index, occupies 40% of space required by original unindexed text. 4 hours for compression, plus the 4 hours for indexing; 8 hours total build time. Combination of signatures and non-inverted document descriptions. Experiment: topics 51-100 versus disk 3. signatures: non-inverted document descriptions: normalized inverse document frequencies: document lengths: mapping of features to numbers: 2. Experiment: topics 101-150 versus disks 1 signatures: non-inverted document descriptions: normalized inverse document frequencies: document lengths: mapping of features to numbers: Uncompressing and indexing: c[OCRerr] 21.5h CPU (all collections of all 3 disks) loading descriptions into access structure: 10 msec.Idocument 169 MB 278 MB 2.1 MB OA MB 4.1 MB and 2. 374 MB 618 MB 2.1 MB OA MB 4.1 MB 6] For each feature occurring in a document description, a bit is set in the signature of the document by applying a hash function to the feature numb signatures are used to determine an approximate RSVO. The documents are ranked according to these RSVO's. Beginning at the top of the ranked exact RSV's are computed using the non-inverted document descriptions. It is not necessary to compute all exact RSV's because documents can ali provided to the user as soon as their exact RSV is bigger than the RSVO of the actually regarded document. [7] For TRW2 (statistical queries), we built a combined word frequency table, phrase frequency table (2 and 3 word phrases), and a special features fi table. These were based on a selected subset of the training database and were used to calculate term weights. They had no direct role in the exe the queries.