SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Bayesian Inference with Node Aggregation for Information Retrieval chapter B. Del Favero R. Fung National Institute of Standards and Technology D. K. Harman 4 Experimental Environment 4.1 Hardware and Software Environment The system was developed by one programmer over the period of six months. It was written in C using the Symantec Think C 5.0 compiler under Macintosh System 7.0.1. The experiments were run on two machines, depending on availability: a Macintosh llfx and a Centris 650. Both machines had 8 Mb of memory and a 500 Mb hard disk. 4.2 Summary of System Data Structures Table 4.1 is a list of the data structures used by the system. Each is listed with the sections of this paper in which it is described and with an indicator of whether it was provided by NIST (N), assessed manually by us (M), or generated automatically by the system (A). Data Structure (Sections Source) Descriptions of Routing Topics (3.1, N) Training Documents (3.1, N) Candidate Feature List (3.1, AIM) Final Feature List (3.1, 4.3.2, A) Relevance Judgements on the Training Documents (3.2.1, N) Topic Relationships (3.2.1, AIM) State Prior Probabilities (3.2.2, 4.3.2, M) Feature Conditional Probabilities (3.2.4, A) Test Documents (3.3, N) Document Posterior Probabilities (3.3, A) Final list of retrieved documents (3.3, A) Table 4.1: Data Structures used by the System The candidate feature list could be defined entirely automatically by sufficiently intelligent procedures such as phrase and proper name identification. The state prior probabilities are the only data items that must be assessed manually. The data files needed by the system, other than those provided by NIST, occupy about 100 Kb on disk. 4.3 Training Phase 4.3.1 Building the Data Structures We are a Category B participant in TREC-2. As such, we used a subset (1ust the WSJ articles) of the full training collection. Of these we used only those (roughly) 29,000 articles for which there is some relevance judgment available. In addition, because we considered only ten of the TREC-2 routing topics, we used only the 5941 documents that have a relevance judgment on at least one of the ten topics. 158 The inputs to a training run are the candidate feature list, the list of topic relationships, the training documents, and the training relevance judgements. The outputs of a training run are the final list of features and the feature conditional probabilities. The time to complete a training run is about 1.5 hours. 4.3.2 Defining the Query The design of the TREC-2 experiment required that before we receive the test data, we submit the definitions of our system and of our particular query to NIST. The query definition includes the final feature list, the list of topic relationships, and the state prior probabilities. All of the other data is determined automatically, as shown in Table 4.1. We ran about two dozen sample experiments that applied the system to the training documents to gauge its performance with different query definitions. In these tests, we varied only two of the data inputs, the state prior probabilities and the final feature list. The list of topic relationships remained constant throughout these tests. It took much longer than expected to finish programming the system. Thus, all of the experiments to define the query were completed in the two weeks immediately preceding the final submission of our query definition to NIST. 4.4 Test Phase As a Category B participant in TREC-2, we ran our routing experiment on just the SJMN articles. The inputs to the test runs were the final list of features, the feature conditional probabilities, the list of topic relationships, and the test documents. The output of the test run is the final list of retrieved documents. A test run takes about 5.5 hours. This includes decompressing the 150 MB of test data, one file at a time, when it is needed. S Results The TREC-2 designation for our system is "idsra2." Figure 5.1 shows the precision-recall curves for the ten topics we considered, excluding topic 88, for which no TREC-2 participant found any relevant documents. Table 5.1 shows several measures of performance for our results, along with the best, median, and worst values for those measures among all TREC-2 participants. We again exclude topic 88.