SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
The QA System
chapter
J. Driscoll
J. Lautenschlager
M. Zhao
National Institute of Standards and Technology
Donna K. Harman
The QA System
James Driscoll, Jennifer Lautenschlager, Mimi zhao
Department of Computer Science
University of Central Florida
Orlando, Florida 32816 USA
Abstract
In the QA system, semantic information is combined with
keywords to measure similarity between natural language
queries and documents. Acombination of keyword relevance
and semantic relevance is achieved by treating keyword
weights and semantic weights alike using the vector pro-
cessing model as a basis for the combination. The approach
is based on (1) the database concept of semantic modeling
and (2) the linguistic concept of thematic roles. Semantic
information is stored in a lexicon built manually using
information found in Roget's Thesaurus.
Keywords: vector prossing model, semantic data model,
semantic lexicon, thematic roles, entity attributes.
1. Introduction
The QA system is based on the semantic approach to text
search reported in [9). The QA system accepts natural
language queries against collections of documents. The
system uses keywords as document identifiers in order to sort
retrieved documents based on their similarity to a query. The
system also imposes a semantic data model upon the "surface
level" knowledge found in text (unstructured information
from a database point of view).
The intent of the QA System has been to provide conve-
nient access to information contained in the numerous and
large public information documents maintained by Public
Affairs at NASA Kennedy Space Center (KSC). During a
launch at KSC, about a dozen NASA employees access these
printed documents to answer media questions. The planned
document storage for NASA KSC Public Affairs is around
300,000 pages (approximately 900 megabytes of disk stor-
age).
Because of our environment, the performance of our
system is measured by a count of the number of documents
one must read in order to find an answer to a natural language
question. Consequently, the traditional precision and recall
measures for IR have not been used to measure the per-
formance of the QA System.
We have had success using semantics to improve the
ranking of documents when searching for an "answer" to a
query in adocument collection of size less than 50 megabytes.
However, it is important to note that our success has been
demonstrated only in a real-world situation where queries are
the length of a sentence and documents are either a sentence
or, at most, a paragraph [8,9].
Our reasons for participating in [OCRerr]fl[OCRerr]EC have been to (1)
learn now our semantic approach fares when traditional IR
measures of performance are used, and (2) test our system on
larger collections of documents. In this paper, we describe
our system, the experiments we performed, our results, and
failure analysis.
2. Overview of the QA System Modified for ThEC
The QA System has been restricted to an IBM compatible
PC platform running under the DOS 5.0 operating system and
without the use of any other licensed commercial software
such as a DOS extender. The DOS version of the QA System
is available at nominal cost from [3). About 2,000 hours of
programming have been used to develop the current software
which includes a pleasant user interface; just as many hours
have been used testing the basic keyword operation of the
system. In addition, approximately 1,000 hours have been
used performing experiments invoWing the semantic aspect
of the QA System. The SQIWI)S relational data base system
has been used to carry out some of these semantic experi-
ments.
The QA System is implemented in C and uses B+ tree
structures for the inverted files. We felt the speed of the
system and its storage overhead was not efficient enough to
appear reasonable for [OCRerr]I[OCRerr]EC, so we designed a separate
system without a pleasant user interface which uses a hashing
scheme to establish codes for strings. This was done to cut
down on storage space and eliminate the use of B+ trees.
Approximately 400 hours of programming and debugging
effort was used to modify the system for the TREC experi-
ments. We kept the DOS environment.
This work has been supported in part by NASA KSC Cooperative Agreement NCC 104)03 Project 2, Florida High Technol-
ogy and Industry Council Grants 494O[OCRerr]ll-28-72l and 4940-11-728, and DARPA Grant 4940-11-28-808.
199