SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
The QA System
chapter
J. Driscoll
J. Lautenschlager
M. Zhao
National Institute of Standards and Technology
Donna K. Harman
The document weight file is a binary file containing a list
of floating point numbers. These numbers are ordered
sequentially by document number, and represent the sum-
marion of the weight in the query squared for a particular
document's keywords.
The inverted index file is ordered by stem code. Each code
has two file pointers, one pointing to the first block of data
in the data file, and the second pointing to the last block of
data. The inverted data file then consists of blocks of data,
containing pairs of document numbers that the code is found
in, and the code's frequency within that document. The
blocks are linked together to form a list.
The document name file consists of a list of pointers into
the original text file.
For Vol.1, both the document weight file and the inverted
index file were two megabytes. The inverted data file was
approximately 385 megabytes, and the document name file
was two megabytes.
4[OCRerr] Basic Pi[OCRerr])cedure
For the TREC experiments, we did the following:
Step 1: This step involves matching every legitimate stem
in the document collection with a unique integer value. This
is done with a linear hashing function. A table containing
this mapping, along with the number of documents each code
is found in, is temporarily saved for use in Step 2.
Step 2: This step creates the four data files described
above. The entire database is scanned, with the four files
being created on the fly. Once this is accomplished, the table
from Step 1 is no longer necessary, and is discarded.
Step 3: Relevant documents for each query are selected
using the Jaccard similarity coefficient. The top 200 docu-
ments for each query are then determined.
The above three steps were followed to create the results
for the September deadline. In the next section we present
our "official" experiments and results, and some "unofficial"
experiments and results.
5. Experiments and Results
Our experiments were intended to be Category A
experiments with two results submitted for each ranking task.
One ranking result would be for just keywords, the other
ranking result would be for keywords combined with
semantics. All query construction was automatic, and the
treatment of ad-hoc routing queries was identical.
We also performed experiments concerning the expected
behavior of string hashing functions and the use of part[OCRerr]f-
speech tagging to improve retrieval performance. These
experiments are not reported here.
As an example of our automatically built ad-hoc and
routing queries, consider Topic 004 reproduced in Figure 3.
Figure 4 indicates the keyword and semantic information
generated by the QA System for this topic. The first part of
Figure 4 indicates the stems along with their frequencies
found in the query. The second part of Figure 4 indicates the
semantic categories also found in the query along with their
expected frequencies and probability present.
It is important to note that the topic represented in Figure
3 and Figure 4 has generated many semantic categories and
the probability present for most of them is close to, or at,
203
100%. This is mainly due to the length of the text involved.
We discovered that, for the TREC document collection, each
document generated many semantic categories with high
probability present. Because we treated semantic categories
like keywords, this caused semantic weights to be essentially
useless.
For the ThEc September deadline, we were only able to
submit a routing experiment using a keywording approach.
The results of this experiment, computed with the aid of Chris
Buckley's SMART evaluation program [1), are shown in
Figure 5. The results are not good. In Section 6, we discuss
what impeded our experiments.
Further "unofficial" experiments were designed to test the
use of semantics. The main goal of our experiments was to
demonstrate that our original routing results could be
improved through the use of semantic analysis. In order to
do this, we made two modifications to our approach. The
first change involved dividing the original TREC documents
into paragraphs. The second change involved a semantic
analysis when calculating the list of relevant documents.
Our experiments involved the use of only six routing
queries (for topics 001,002,003,007,017, and 022). These
topics were selected because our original results for them
were poor. Through the use of semantic analysis, we hoped
to significantly improve our results. Figure 6 shows the
precision-recall statistics for the six "poor" queries using the
retrieval results which created the statistics in Figure 5.
When analyzing our results, we computed all precision-
recall tables through the use of Chris Buckley's SMART
evaluation program [1). The relevancy lists used were those
produced before November 1 (the original qrels for the
routing queries). We did not use the modified results that
were distributed later for Query 017. This should not affect
our results, though, because our experiments were aimed only
at improving our precision-recall averages, and the relevancy
results used were consistent from one experiment to the other.
5.1 Re-[OCRerr]nk'ng of Documents
In an effort to demonstrate that semantics could affect
retrieval in the TREC environment, we used the original QA
System with a semantic lexicon containing TREC words as
described in Section 3.2. We created a separate database for
each query we considered, for a total of six databases. Each
database contained only the documents that we originally
judged in the top 250 for each query. Because of this, when
we computed new relevancy lists we were simply rearranging
the order of the same 250 documents, IIQL bringing in new
documents.
Figure 7 reveals the precision-recall statistics when orig-
inally retrieved documents fora query are used as a document
collection and re-ranked by imposing the query again. There
is a 25.8% increase when comparing the 11-pt average here
to the 11-pt average of the originally retrieved text (Figure
6).
To determine the ranking of a particular document with
paragraph divisions, we defined the similarity coefficient of
a document to be equalto the highest coefficient associated
with one of its corresponding paragraphs. The paragraph
divisions were automatically constructed from the original
text. The precision-recall statistics for paragraphs being used
as documents are shown in Figure 8. There is an 18.8%