SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Bayesian Inference with Node Aggregation for Information Retrieval
chapter
B. Del Favero
R. Fung
National Institute of Standards and Technology
D. K. Harman
4 Experimental Environment
4.1 Hardware and Software Environment
The system was developed by one programmer over the
period of six months. It was written in C using the
Symantec Think C 5.0 compiler under Macintosh System
7.0.1. The experiments were run on two machines,
depending on availability: a Macintosh llfx and a Centris
650. Both machines had 8 Mb of memory and a 500 Mb
hard disk.
4.2 Summary of System Data Structures
Table 4.1 is a list of the data structures used by the system.
Each is listed with the sections of this paper in which it is
described and with an indicator of whether it was provided
by NIST (N), assessed manually by us (M), or generated
automatically by the system (A).
Data Structure (Sections Source)
Descriptions of Routing Topics (3.1, N)
Training Documents (3.1, N)
Candidate Feature List (3.1, AIM)
Final Feature List (3.1, 4.3.2, A)
Relevance Judgements on the Training Documents
(3.2.1, N)
Topic Relationships (3.2.1, AIM)
State Prior Probabilities (3.2.2, 4.3.2, M)
Feature Conditional Probabilities (3.2.4, A)
Test Documents (3.3, N)
Document Posterior Probabilities (3.3, A)
Final list of retrieved documents (3.3, A)
Table 4.1: Data Structures used by the System
The candidate feature list could be defined entirely
automatically by sufficiently intelligent procedures such as
phrase and proper name identification. The state prior
probabilities are the only data items that must be assessed
manually.
The data files needed by the system, other than those
provided by NIST, occupy about 100 Kb on disk.
4.3 Training Phase
4.3.1 Building the Data Structures
We are a Category B participant in TREC-2. As such, we
used a subset (1ust the WSJ articles) of the full training
collection. Of these we used only those (roughly) 29,000
articles for which there is some relevance judgment
available. In addition, because we considered only ten of
the TREC-2 routing topics, we used only the 5941
documents that have a relevance judgment on at least one
of the ten topics.
158
The inputs to a training run are the candidate feature list,
the list of topic relationships, the training documents, and
the training relevance judgements. The outputs of a
training run are the final list of features and the feature
conditional probabilities.
The time to complete a training run is about 1.5 hours.
4.3.2 Defining the Query
The design of the TREC-2 experiment required that before
we receive the test data, we submit the definitions of our
system and of our particular query to NIST. The query
definition includes the final feature list, the list of topic
relationships, and the state prior probabilities. All of the
other data is determined automatically, as shown in
Table 4.1.
We ran about two dozen sample experiments that applied
the system to the training documents to gauge its
performance with different query definitions. In these tests,
we varied only two of the data inputs, the state prior
probabilities and the final feature list. The list of topic
relationships remained constant throughout these tests.
It took much longer than expected to finish programming
the system. Thus, all of the experiments to define the query
were completed in the two weeks immediately preceding
the final submission of our query definition to NIST.
4.4 Test Phase
As a Category B participant in TREC-2, we ran our routing
experiment on just the SJMN articles.
The inputs to the test runs were the final list of features, the
feature conditional probabilities, the list of topic
relationships, and the test documents. The output of the test
run is the final list of retrieved documents.
A test run takes about 5.5 hours. This includes
decompressing the 150 MB of test data, one file at a time,
when it is needed.
S Results
The TREC-2 designation for our system is "idsra2." Figure
5.1 shows the precision-recall curves for the ten topics we
considered, excluding topic 88, for which no TREC-2
participant found any relevant documents.
Table 5.1 shows several measures of performance for our
results, along with the best, median, and worst values for
those measures among all TREC-2 participants. We again
exclude topic 88.