SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) The QA System chapter J. Driscoll J. Lautenschlager M. Zhao National Institute of Standards and Technology Donna K. Harman after: TIme(3), None(3), External and Internal Dimensions(2), Motion with Respect to Direction, Order, Li[OCRerr]ation[OCRerr]pace above: Amount(2), Ii)Catio[OCRerr]pace, None(2), linear Dimensions(2), Order, External and Internal Dimensions, Position as: Condition, Comparison, Manner1 None(2), Amount at: Condition, Ii)cation[OCRerr]S pace, Manner, Time, Position, External and Internal Dimensions, Duration ato p : [OCRerr]cation[OCRerr]Space, Position befbre: [£[OCRerr]tion[OCRerr]Space, Time(4), External and Internal Dimensions, Motion with Respect to Direction, Order, None(2), Position below: Motion with Respect to Direction, None(2), Amount, State, linear Dimensions, Iocatio[OCRerr]pace, Position between: Comparison, Duration, External and Internal Dimensions, Icication/Space, Position, Amount by: Amount, Conveyance, Ic:cation/Space, Time, Position, External and Internal Dimensions, Means, Order, Motion with Respect to Direction, Duration, None during: Duration, Time except: Condition, None(2), Order, Amount(2) for: Duration, Goal, Purpose, Variation, Destination, Beneficiary, Amount, Range from: Cause, Source, Time, Motion with Respect to Direction(2), Amount, location/Space, Position, Instrument in: Instrument, location/Space, Purpose, [OCRerr]me, Motion with Respect to Direction(S), External and Internal Dimensions(2), Position, Condition, Goal, Means into: Condition, location/Space, Time, Motion with Respect to Direction, External and Internal Dimensions, Position, None like: Comparison(3), Amount(2), Condition, Manner, None(7) of: Cause, location/Space, Source, None, Time, Duration, Position on: Condition, Conveyance, location/Space, Source, Time, linear Dimensions(2), Motion with Respect to Direction, General Dimensions, Position, Means, External and Internal Dimensions, Purpose over: Duration, location/Space, Time(2). Order, linear Dimensions(3), Amount(4), General Dimensions, Motion with Respect to Direction, External and Internal Dimensions, Position, Degree, Condition per: Means, Order(2) thrzugh: None, Order, Goal, linear Dimensions, Means, Time, location/Space, Position, Motion with Respect to Direction to: Accompaniment, Beneficiary, Condition, De[OCRerr]ee, Ii)cation/Space, Purpose, Result, Time, General Dimensions, Position, Motion with Respect to Direction(2), Companson under: Condition, location/Space, Position, Order, Degree, None, Amount, linear Dimensions(3) until: Condition, Time upon: Condition, Conveyance, Source, Time, None, General Dimensions, linear Dimensions, Means, External and Internal Dimensions, Motion with Respect to Direction, Purpose with: Accompaniment(2), Comparison, Manner, Result, Amount(3), Means, None, Order, location/Space, Position, Time within: Range, External and Internal Dimensions(2), Time, Degree, Position, location/Space without: External and Internal Dimensions, Order, Amount, None(2) Figure 2. Prepositions and Their Semantic Categories [OCRerr]eprinted by Permission of [3J). For ThEC semantic experiments performed after the September deadline (refer to Section 5.2), we treated semantic categories like keywords and used the following normalized similarity coefficient: a +s 1X[OCRerr]1 Wqj di[OCRerr] sim[OCRerr],D[OCRerr]) - 11+3 wheres -36 is the number of semantic categories. It should be pointed out that we are still performing experiments to determine a proper blend of keyword and semantic infor- mation. 4. System Details The following is an overview of the system we used to perform the ThEC experiments. As pointed out earlier, many hours of programming and debugging effort were used to create a system for the ThEC experiments. 4.1 The Scanning Pt[OCRerr][OCRerr]cess Under the QA scanning procedure, scanning a ciccument for indexing terms is a three-step process: A. A token is scanned. B. The token is analy:eed. It is compared to a list of 166 stopwords, and these words (along with any numbers) are later discarded. Dates are transformed into a generic format, 202 to allow for the matching of dates in a variety of different formats. Non-valid hyphenated words are separated into mulliple tokens. Words that can be abbreviated (as deter- mined by a list of 122 abbreviations) are replaced by their abbreviated form. Acronyms of multiple words are detected, and replaced by their respective acronyms. C. The remainder of the words are stemmed according to a modified version of the 3. B. lovins stemming algorithm [5). In some cases, prefixes are also removed. The scanning procedure for the ThEC experiments is a modification of the QA scanning procedure, in which only the words found in the text fields of the documents are tokenized. In order to speed up text processing, the amount of analyzing in step B is reduced. Dates are no longer transformed, and abbreviations and acronyms are not created. The stemming algorithm and removal of prefixes in step C remains unchanged. For the ThEC scanning procedure, an additional step is added. During this extra step, a hashing algorithm is used to assign each indexing term a unique 32-bit integer value, which we call a stem code. 4.2 Data Files Aftera document is processed by the ThEC version of the QA System, four primary data files are created. These are the document weight file, the inverted index file, the inverted data file, and the document name file.