SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Appendix B: System Features Appendix National Institute of Standards and Technology D. K. Harman Is. CONSTRUCTION OF INDICES, KNOWIEDGE BASES, AND OTHER DATA STRUCTURES - STATI[OCRerr]CS ON DATA STRUCTURES (CON'T) .`d IF'IDF to pull out significant terms from documents, found significant term pairs using entropy-based statistic. irst[OCRerr]rder thesaurus (which is a collection of core terminology) is constructed for each topic. In the case of routing topics, the source for the thesaurus for e[OCRerr] of query-specific known relevant documents. In the case of ad-hoc queries, relevant documents are automatically identified via an initial retrieval (queryin[OCRerr] ire collection. [uses "reduced[OCRerr]imensional" term and document vectors (sec below for details of how they are constructed). For TREC-2, we used approximately 200 dimensi I each document has a real-valued vector in this 200 dimensional space. For the routing queries, we used an 88112 term x 68385 doc sample to construct an 151- tors are embedded in this space. (At this point, all the documents used in the scaling could be removed. We did not do so.) The resulting data structure )uld be reduced to 70 MB if documents omitted.) For adhoc queries, we constructed an 82%8 term x 69997 doc 151-space. We folded in the 672358 CD-12 re not in this sample. The resulting data structure was 549 MB. MB for routing (could be reduced to 70 MB), 549 MB for Adhoc. proximately 20 hours for the routing SVD and 20 hours for the adhoc SVD. In addition, 672K documents were added for the adhoc run, taking about 2 h( on a SpardO with 128 MB RAM or 384 MB RAM.) I,sVD analysis of document collections. Create a raw term-by document matrix, and transform the cell entries by the appropriate weighting scheme. Used SMART pre-processing for this. Calculate the best reduced k[OCRerr]imensional approximation to this matrix using singular value decomposition (SVD). For the TREC-2 experiments, about were used in the approximation - 199 dimensions for adhoc and 204 dimensions for the routing queries. Retrieval uses this 2O()[OCRerr]dimensional 151-sr If necessary, fold in any terms or documents that are not in the original SYD analysis. Necessry for adhoc queries, not for routing queries. [OCRerr]work node, edge files; routing using network node and edge files is straightforward. [OCRerr]e file: 8x20; Edge File: 8[OCRerr] for 1 GB. )uting: 1. Process 1 GB from Disk 1 (WSJ1, APi, DOE, FRi, ZIFi). 2. Process queries against Disk 1 (training). 3. Process new Disk 3 as if they were queries -- to make use of Disk 1 statistics. 4. Combine queries, (old) dictionary and Disk 3 into network for retrieval. 5 Docnum file 6. Termnum (dictionary) file 7. Node file 8. Edge file Subdocument file Coded file (direct file) DOC ID checking file TERM ID checking file [OCRerr]r 1 GB): 1GB 4. 1.1GB 5. 16 6. [OCRerr]or 1 GB): 3 6 7. 8x20 20 8. 8x5 7.5 -6. 40 ,8. 8x0.75=6 [OCRerr]s, if sufficient RAM and disk space. For this experiment, No. Two hours of manual labor. Law Text --> Subdocument file ubdocument --> coded file, DOC ID file, TERM ID file, doenum file, termnum (dictionary) file. [OCRerr]oded, termnum, doenum --> node, edge files. [OCRerr]ynonymous complex nominal listings; oncept - relation - concept triples MB (synonymous complex nominal listings; [OCRerr],i98 MB - WSJ (concept - relation - concept triples) [OCRerr]ss than one hour for building complex nominal lists based on 50 topic statements. L) Special purpose grammar which is written to extract complex nominals A set of sp[OCRerr]i[OCRerr]a[OCRerr]l[OCRerr] handlers process tagged and bracketed text based on the knowledge base to extract concept - relation - concept triples.