NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman cutoffs could be different for synonymy and speciali- zation. For example, the following were obtained from TRBC WSJ training database: with ICW (child) - 0.000001 ICW (baby) - 0.000013 JCW (infant) - 0.000055 SJM (child, infant) - 0.131381 SIM (baby, child) - 0.183064 SJM (baby, infant) -0.323121 Therefore both baby and infant can be used to spe- cialize child (with 6t = 10), while baby and infant can be considered synonyms (with [OCRerr] = 5). Note that if [OCRerr] is well chosen, then the above filter will also help to reject antonymous and complementary relations, such as SIMnorm(back,front)=0.305287 with ICW (back )=0.000001 and ICW ("rnntY-0.000006. We continue working to develop more effective filters. Bxwnples of filtered similarity relafions obtained from TREC corpus are listed in Tables 2 and 3. SUMMARY OF RESULTS We have processed the total of 500 MBytes of [OCRerr]trticles from Wall Street Journal section of ThEC database. Runs were made independently on training corpus (250 MBytes) and the test corpus (another 250 Mbytes). Natural language processing of each half of the corpus, including tagging, stemming, parsing and pair extraction required approximately 2 weeks of nearly uninterrupted processing on two Sparc works- tations (we use an assortment of SparcStation 1, BLC and 2, with MWS rates varying from 13 to 21 to 28.5). 22 Computing term similarities from the first 250 Mbytes of text took up another week. This time limitations in our computing resources, mostly memory, were the main reason. With a sufficient amount of RAM (on the order of 256 MBytes or more) we estimated that this process should take no more than 20 hours. It should be noted that we com- puted term similarities ktsed only on the training corpus, but we used them in retrieval freim either database. We assumed that the underlying set of concepts and describing them terms does not vary greatly between the two databases (the first covered Wall Street Journal from 1987 to 1989, the second from 1990 to 1992). 22 Processing of the training corpus took significantly longer in practice (real time) since we had only one Sparcstation available at [OCRerr]y given time, and also due too errors made and resulting froin them need for re-processing soitie parts. 182 Subsequent indexing process performed by the NIST IR system required additional 2 weeks for each half of the database, on a single SparcStation 2.23 In tot£'tl, we created three inverted file indexes: two were derived from the training corpus, and one from the test corpus. The first index created from training corpus did not include compound terms; the remain- mg two indexes included them. No cumulative index was produced, since we estimated it would require more than 6 weeks to build. Some interesting discrepancies were observed. While the postings file generated from the training corpus was 25% larger that the one generated from the test corpus, the corresponding dictionary of terms was about 25% smaller in the training corpus. This seems to suggest that in 1990-92 volumes of Wall Street Journal (test corpus) a greater number of unique terms were used, while these terms were on average occurring more sparsely in the text. This fact may partially under- mine an earlier assumption that term similarities gen- erated from training corpus were adequate for test corpus as well. During the training stage of TREC we per- formed only very limited tests, mostly because the base system retrieval was quite slow on a database of this size (approx. 173 MBytes, 99+K documents). 24 We run experiments with topics 001 to 005, using relevance judgements supplied with the training data- base. The purpose of these tests was to (1) observe any improvement in performance resulting from the use of compound phrase terms, (2) tune the term similarity filter by adjusting certain threshold values in an attempt to obtain the most effective set of simi- larities, and (3) note if any manual intervention into the queries could improve retrieval results. As may be expected, none of these experiments were con- clusive, but this was all we could rely upon in the short time given. First of all, we noticed that system overall performance (cumulative recall and precision levels) were what we could have expected from runs on smaller benchmark collections such as CACM- 3204. This tnay be partially explained by the diver- sity of the subject tnatter covered in the TREC data, and also by the character of the queries (i.e., requests to extract information), but some problem with the base search engine may also be to blame. 25 We were 23 Indexing process could not l[OCRerr] done in parallel because NIST system requires serial processing. Moreover, as indexing progressed, and the partial database grows in size, the process slowed down significantly from approx. 6 mm/file to 37 mm/file. 24 The system needed approx. 60 minutes to process a quety. In the middle of the training run we discovered that the NIST system has been designed to handle databases up to 65K do- cuments (21b) Subsequently we rewrote portions of the code to fix this, but we were not certain if other decisions made by the re- tneval module (e.g., bitmapping, idf cutoffs) were not in fact coun-