SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
The QA System
chapter
J. Driscoll
J. Lautenschlager
M. Zhao
National Institute of Standards and Technology
Donna K. Harman
For the TREC experiments, weused nine IBMPS/2 Model
95 computers. These were 50MHz 486 computers, each with
8 megabytes of RAM and one 400 megabyte hard drive. Two
of the machines had 16 megabytes of RAM and another one
had two 400 megabyte hard drives. A 33 Mf[OCRerr] 486 PC was
used to distribute text to the nine IBM machines, and then
collect information for merging and redistribution.
Our plan was to put each of the nine Vol.1 and Vol.2
document collections on a separate machine, determine the
document frequency for each term tj on each machine, and
then merge the results and determine the inverse document
frequency for each term in the entire collection. Next, we
would index and query each document collection separately,
and finally merge the retrieval results.
The vector processing model is the basis for our approach.
Terms used as document identifiers are keywords modified
by various techniques such as stop lists, stemming, synonyms,
and query reformulation.
The calculation of the weighting factor (w) for a term in a
document is a combination of term frequency (ti), document
frequency (dJ), and inverse document frequency (id]). The
basic definitions are as follows:
tj[OCRerr], = number of occurrences of term tj in document D[OCRerr]
df1 - number of documents in a collection which contain tj
idf[OCRerr] - Iog(NIdf,), where N = total number of documents
W.. =[OCRerr]f.*jdf
£j `J,5J
When the system is used to query acollection ofdocuments
with I terms, the system computes a vector Q equal to
(Wql,Wq2,..., Wqg) representing the weights for each term in the
query. The retrieval of a document with vector D[OCRerr] equal to
4') representing the weights of each term in the
document is based on the value of a similarity measure
between the query vector and the document vector. For
the NIST experiments, we used the Jaccard similarity
coefficient [7] to retrieve documents for the September
deadline.
Jaccard Coefficient
X Wqjdjj
j-1
sim[OCRerr],D[OCRerr]) = I I *
X(41)2+ X(w[OCRerr]1)2- X Wq[OCRerr]j
j-1 j-1 1-1
3. SemantIc App[OCRerr]ach
Mthough the basic IR approach has shown some success
in regard to natural language queries, it ignores some valuable
information. For instance, consider the following query:
How long does the payload crew go through
training before a launch?
The typical IR system dismisses the following words in the
query as useless or empty: "how", "does", "the", "through",
200
"before", and "a". Some of these words contain valuable
semantic information. The following list indicates some of
the semantic information triggered by a few of these words:
how long Duration, Time
through I£[OCRerr]tion/Space, Motion with Reference
to Direction, Time
before 1[OCRerr]tion[OCRerr]pace, Time
The database concept of semantic modeling and the linguistic
concept of thematic roles can help in this regard.
3.1 Semantic Modeling
Semantic modeling was an object of considerable database
research in the late 1970's and early 1980's. Abriefoverview
can be found in [2]. Essentially, the semantic modeling
approach identified concepts useful in talking informally
about the real world. These concepts included the two notions
of entities (objects in the real world) and relationships among
entities (actions in the real world). Both entities and rela-
tionships have properties.
The properties of entities are often called attributes. There
are basic or surface level attributes for entities in the real
world. Examples of surface level entity attributes are General
Dimensions, Color, and Position. These properties are
prevalent in natural language. For example, consider the
phrase "large, black book on the table" which indicates the
General Dimensions, Color, and Position of the book.
In linguistic research, the basic properties of relationships
are discussed and called thematic roles. Thematic roles are
also referred to in the literature as partidpant roles, semantic
roles and case roles. Examples of thematic roles are Bene-
ficiary and Time. Thematic roles are prevalent in natural
language; they reveal how sentence phrases and clauses are
semantically related to the verbs in a sentence. For example,
consider the phrase "purchase for Mary on Wednesday"
which indicates who benefited from a purchase (13eneficiary)
and when a purchase occurred (Fime).
The goal of our approach is to detect thematic information
along with attribute information contained in natural Ian-
guage queries and documents. When the information is
present, our system uses it to help find the most relevant
document. In order to use this additional information, the
basic underlying concept of text relevance as presented earlier
needs to be modified. The major modifications include the
addition of a lexicon with thematic and attribute information,
and a modified computation of the similarity coefficient. We
now discuss our semantic lexicon.
3.2 The Semantic lexicon
Our system uses a thesaurus as a source of semantic
categories (thematic and attribute information). For example,
Roget's Thesaurus contains a hierarchy of word classes to
relate word senses [6]. For our research, we have selected
several classes from this hierarchy to be used for semantic
categories. We have defined thirty-six semantic categories
as shown in Figure 1.