SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
The QA System
chapter
J. Driscoll
J. Lautenschlager
M. Zhao
National Institute of Standards and Technology
Donna K. Harman
Thematic Rote Categories Attribute Categories
A [OCRerr] m[OCRerr]nt C'Lolor
Pvt[OCRerr]rn[OCRerr]1 [OCRerr]nd Internal Dimensions
Amoun[OCRerr]
[OCRerr]pnefi[OCRerr]i [OCRerr]r', Form
Cause
r[OCRerr]nd[OCRerr]r
(`[OCRerr]n[OCRerr]r[OCRerr]1 flirnpn[OCRerr]ion[OCRerr]
(`[OCRerr]ndition
rr'nin[OCRerr]iriqnn T [OCRerr] flinipn[OCRerr]onq
[OCRerr]A[OCRerr];[OCRerr]n r[OCRerr]nininp[OCRerr] with Force
(`onvevance
fle[OCRerr][OCRerr]e
n[OCRerr][OCRerr]t; [OCRerr][OCRerr]ti[OCRerr]n
RA[OCRerr];r'[OCRerr] with [OCRerr] to Direction
1VLLJIi[OCRerr][OCRerr] 1[OCRerr][OCRerr]i[OCRerr]L[OCRerr]11 - -- -
flur[OCRerr]tion Order
(;ml Phv[OCRerr]ica1 Prorerties
I n[OCRerr]rument Position
T ncation/Smce State
\"At'ner Temnerature
Me[OCRerr]n[OCRerr][OCRerr] Use
Pii[OCRerr]e Variation
RRn[OCRerr]e
[OCRerr][OCRerr]Q,1lt
Time
Figure 1. Thirty-Six Semantic Categories.
Prior to ThEC, there were 3,000 entries in the lexicon
established by manual examination of roughly 6,000 of the
most frequent words occurring in NASA KSC text. For
ThEC, we made 1,000 new entries by examination of 1,700
frequent words occurring in the training text and the 52
training topics. Since the 1911 edition of Roget's Thesaurus
has become public domain, we also created software which
automatically extracted approximately 20,000 lexicon
entries. However, we did not have enough time to explore
the use of these entries.
In order to explain the assignment of semantic categories
to a given term using Roget's Thesaurus, consider the brief
index quotation for the term "vapor":
vapor
n. fog 404.2
fume 401
illusion 519.1
4.3
[OCRerr]sPe[OCRerr][OCRerr]i 328.10
thing imagined 535[OCRerr]
v. be bombastic 601.6
bluster 911.3
boast 910.6
exhale 310.23
talk nonsense 547.5
The eleven different meanings of the term "vapor" are given
in terms of a numerical category. We have developed a
mapping of the numerical categories in Roget's Thesaurus to
the thematic role and attribute categories given in Figure 1.
In this example, "fog" and "fume" correspond to the attribute
State; "steam" maps to the attribute Temperature; and "ex-
201
hale" is a trigger for the attribute Motion with Reference to
Direction. The remaining seven meanings associated with
"vapor" do not trigger any thematic roles or attributes. Since
there are eleven meanings associated with "vapor," we
indicate in the lexicon a probability of 1/11 each time a
category is triggered. Hence, a probability of 2/11 is assigned
to State, 1/11 to Temperature, and 1/11 to Motion with
Reference to Direction. This technique of calculating prob-
abilities is being used as a simple alternative to a corpus
analysis. It should be pointed out that we are still
experimenting with other ways of calculating probabilities.
Figure 2 shows lexicon entries for prepositions as asample
of the lexicon used in our experiments. These entries are
somewhat misleading. Most p repositions trigger too many
semantic categories to be of real use. The prepositions
"during" and "until" are examples of useful prepositions.
3[OCRerr] Extended Computation of the Similarity Measure
The probabilistic details of a semantic lexicon and the
computation of semantic weights can be found in [9]. A
detailed explanation of the manner in which we combine
semantic weights and keyword weights can be found in [8].
Essentially we treat semantic categories like indexing
terms, and the probabilities introduced by a semantic lexicon
mean that the frequency of a category in a document becomes
an [OCRerr] frequency and the presence of a category in a
document becomes a probability for the category being
present. This means that the document frequency for a
category becomes an expected document frequency, and this
enables an inverse document frequency to be calculated for
a category.