NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman CREATING AN INDEX Most of the problems we encountered when adapt- ing NIST's PRISE system for use in TREC-2 had to do with the size of the data that had to be indexed. We had to deal with the restrictions imposed by the resources we had (e.g., only % MBytes of virtual memory). The rest of this paragraph signals some of tile changes we made to the NIST system in order to deal with our restrictions. The original system would request twice the previously requested amount of memory each time it needed more. As a result of this the system would reach the limit of virtual memory after only a relatively small portion of the total number of documents had been indexed. `flour version, tile memory requested by the system grows linearly. The increments are estimated in such a way that the system never requests too much memory. The indexing process became too fragile when the limits of the enviromnent were approached. When a large portion of the virtual memory and of the disk space was being used by the indexing process, crashes became very likely. Unfortunately, it turned out that the process was very difficult to restart after some crashes (e.g., in the rebuild phase), thus leading to time consuming repeats. Indexing also takes too long at present. Given tile size of tile data to be indexed the whole process takes at least 250 hours if everything goes well, which happens seldom. Given TREC-2's deadlines we could not afford to perform too many experiments: we barely had time to index the corpus once. Most of the previous problems could be solved by distributing the indexing process to several different machines and perforting the indexing in parallel. We believe that it is possible to create several small indexes instead of a single very large one. if cer- tain rules are followed when creating the distributed index, it should be possible to merge the results of query- ing the set of small indexes and to obtain a performance (recall and precision) co(nparable to the results obtained using a single index. The test setup we bullt in order to perform the experiments required for TREC-2 should allow us to test these hypotheses. The advantages of a distributed index are clear: (1) The indexing process would be faster. (2) Each one of the distributed indexing processes would be smaller and less fragile. (3) Even if one of the distributed processes crashes restarting it would be less expensive. (4) A distributed system would be much easier to update, i.e., adding a new document would not require to reindex the whole corpus. 130 (5) A distributed system would be more likely to be useflil in order to study the kinds of problems and solutions that are likely to be encountered in a real world situation. SUMMARY OF RESULTS We have processed the total of 850 MBytes of text during TREC-2. The first 550 MBytes were articles from the Wall Street Journal which were previously processed for TREC-1; we had to repeat most of tile processing to correct early tokenization errors introduced by tile tagger. The entire process (tagging, parsing, phrase extraction) took just over 4 weeks on 2 Sun's SparcSta- tions (1 and 2). Building a comprehensive index for the WSJ database took up another 2 weeks. This time we were able to create a single index thanks to the improved indexing software received from NIST. The final index size was 204 MBytes, and included 2,274,775 unique terms (of which about 310,000 were single word terms, and the remaining 1,865,000 were syntactic word parrs) oceur]iug in 173,219 documents, or more than 13 (unique) terms per document. Note that this gives poten- tially much better coverage of document content than single word terms alone with less than 2 unique terms per document. We say `potentially' since the proc[OCRerr]ss of deriving phrase-level terms from text is still only par- tially understood, including the complex problem of `normalization' of representation. The remaining 300 MBytes were articles from tile San Jose Mercury News, which were contained in TIP- SThR disk-3. Processing of this part, and creating an index for routing purposes took about 3 weeks. While natural language processing required 2 weeks to com- plete (at approximately the same speed as WSJ data- base), we were able to cut indexing time in half by using a faster in-memory version of the NIST system. This new version reduces the time required by the first phase of indexing froni days to hours, however the second phase remains slow (days) and fraglle (we had to redo it 3 times). The final size of the SJMN index was 101 MBytes, with 1,535,971 unique terms occurring in 86,270 documents (nearly 18 unique terms per docu- ment).13 Two types of retrieval have been done: (1) new topics 101-150 were run in the ad-hoc mode against WSJ database, and (2) topics 51-100, previously used in TREC-1, were run in the routing mode against SJMN database. "leach category several runs were attempted 13 [OCRerr] has to be noted that the ratios at which new terms are gen- erated are nearly identical in both databases; at 86,319 documents (or about half way through WSJ database) 1,335,622 unique terms had been recorded.