SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
The QA System
chapter
J. Driscoll
J. Lautenschlager
M. Zhao
National Institute of Standards and Technology
Donna K. Harman
after: TIme(3), None(3), External and Internal Dimensions(2), Motion with Respect to Direction, Order, Li[OCRerr]ation[OCRerr]pace
above: Amount(2), Ii)Catio[OCRerr]pace, None(2), linear Dimensions(2), Order, External and Internal Dimensions, Position
as: Condition, Comparison, Manner1 None(2), Amount
at: Condition, Ii)cation[OCRerr]S pace, Manner, Time, Position, External and Internal Dimensions, Duration
ato p : [OCRerr]cation[OCRerr]Space, Position
befbre: [£[OCRerr]tion[OCRerr]Space, Time(4), External and Internal Dimensions, Motion with Respect to Direction, Order, None(2), Position
below: Motion with Respect to Direction, None(2), Amount, State, linear Dimensions, Iocatio[OCRerr]pace, Position
between: Comparison, Duration, External and Internal Dimensions, Icication/Space, Position, Amount
by: Amount, Conveyance, Ic:cation/Space, Time, Position, External and Internal Dimensions, Means, Order, Motion with Respect
to Direction, Duration, None
during: Duration, Time
except: Condition, None(2), Order, Amount(2)
for: Duration, Goal, Purpose, Variation, Destination, Beneficiary, Amount, Range
from: Cause, Source, Time, Motion with Respect to Direction(2), Amount, location/Space, Position, Instrument
in: Instrument, location/Space, Purpose, [OCRerr]me, Motion with Respect to Direction(S), External and Internal Dimensions(2), Position,
Condition, Goal, Means
into: Condition, location/Space, Time, Motion with Respect to Direction, External and Internal Dimensions, Position, None
like: Comparison(3), Amount(2), Condition, Manner, None(7)
of: Cause, location/Space, Source, None, Time, Duration, Position
on: Condition, Conveyance, location/Space, Source, Time, linear Dimensions(2), Motion with Respect to Direction, General
Dimensions, Position, Means, External and Internal Dimensions, Purpose
over: Duration, location/Space, Time(2). Order, linear Dimensions(3), Amount(4), General Dimensions, Motion with Respect to
Direction, External and Internal Dimensions, Position, Degree, Condition
per: Means, Order(2)
thrzugh: None, Order, Goal, linear Dimensions, Means, Time, location/Space, Position, Motion with Respect to Direction
to: Accompaniment, Beneficiary, Condition, De[OCRerr]ee, Ii)cation/Space, Purpose, Result, Time, General Dimensions, Position,
Motion with Respect to Direction(2), Companson
under: Condition, location/Space, Position, Order, Degree, None, Amount, linear Dimensions(3)
until: Condition, Time
upon: Condition, Conveyance, Source, Time, None, General Dimensions, linear Dimensions, Means, External and Internal
Dimensions, Motion with Respect to Direction, Purpose
with: Accompaniment(2), Comparison, Manner, Result, Amount(3), Means, None, Order, location/Space, Position, Time
within: Range, External and Internal Dimensions(2), Time, Degree, Position, location/Space
without: External and Internal Dimensions, Order, Amount, None(2)
Figure 2. Prepositions and Their Semantic Categories
[OCRerr]eprinted by Permission of [3J).
For ThEC semantic experiments performed after the
September deadline (refer to Section 5.2), we treated
semantic categories like keywords and used the following
normalized similarity coefficient:
a +s
1X[OCRerr]1 Wqj di[OCRerr]
sim[OCRerr],D[OCRerr]) -
11+3
wheres -36 is the number of semantic categories. It should
be pointed out that we are still performing experiments to
determine a proper blend of keyword and semantic infor-
mation.
4. System Details
The following is an overview of the system we used to
perform the ThEC experiments. As pointed out earlier, many
hours of programming and debugging effort were used to
create a system for the ThEC experiments.
4.1 The Scanning Pt[OCRerr][OCRerr]cess
Under the QA scanning procedure, scanning a ciccument
for indexing terms is a three-step process:
A. A token is scanned.
B. The token is analy:eed. It is compared to a list of 166
stopwords, and these words (along with any numbers) are
later discarded. Dates are transformed into a generic format,
202
to allow for the matching of dates in a variety of different
formats. Non-valid hyphenated words are separated into
mulliple tokens. Words that can be abbreviated (as deter-
mined by a list of 122 abbreviations) are replaced by their
abbreviated form. Acronyms of multiple words are detected,
and replaced by their respective acronyms.
C. The remainder of the words are stemmed according
to a modified version of the 3. B. lovins stemming
algorithm [5). In some cases, prefixes are also removed.
The scanning procedure for the ThEC experiments is a
modification of the QA scanning procedure, in which only
the words found in the text fields of the documents are
tokenized. In order to speed up text processing, the amount
of analyzing in step B is reduced. Dates are no longer
transformed, and abbreviations and acronyms are not created.
The stemming algorithm and removal of prefixes in step C
remains unchanged.
For the ThEC scanning procedure, an additional step is
added. During this extra step, a hashing algorithm is used to
assign each indexing term a unique 32-bit integer value,
which we call a stem code.
4.2 Data Files
Aftera document is processed by the ThEC version of the
QA System, four primary data files are created. These are
the document weight file, the inverted index file, the inverted
data file, and the document name file.