SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) The QA System chapter J. Driscoll J. Lautenschlager M. Zhao National Institute of Standards and Technology Donna K. Harman Thematic Rote Categories Attribute Categories A [OCRerr] m[OCRerr]nt C'Lolor Pvt[OCRerr]rn[OCRerr]1 [OCRerr]nd Internal Dimensions Amoun[OCRerr] [OCRerr]pnefi[OCRerr]i [OCRerr]r', Form Cause r[OCRerr]nd[OCRerr]r (`[OCRerr]n[OCRerr]r[OCRerr]1 flirnpn[OCRerr]ion[OCRerr] (`[OCRerr]ndition rr'nin[OCRerr]iriqnn T [OCRerr] flinipn[OCRerr]onq [OCRerr]A[OCRerr];[OCRerr]n r[OCRerr]nininp[OCRerr] with Force (`onvevance fle[OCRerr][OCRerr]e n[OCRerr][OCRerr]t; [OCRerr][OCRerr]ti[OCRerr]n RA[OCRerr];r'[OCRerr] with [OCRerr] to Direction 1VLLJIi[OCRerr][OCRerr] 1[OCRerr][OCRerr]i[OCRerr]L[OCRerr]11 - -- - flur[OCRerr]tion Order (;ml Phv[OCRerr]ica1 Prorerties I n[OCRerr]rument Position T ncation/Smce State \"At'ner Temnerature Me[OCRerr]n[OCRerr][OCRerr] Use Pii[OCRerr]e Variation RRn[OCRerr]e [OCRerr][OCRerr]Q,1lt Time Figure 1. Thirty-Six Semantic Categories. Prior to ThEC, there were 3,000 entries in the lexicon established by manual examination of roughly 6,000 of the most frequent words occurring in NASA KSC text. For ThEC, we made 1,000 new entries by examination of 1,700 frequent words occurring in the training text and the 52 training topics. Since the 1911 edition of Roget's Thesaurus has become public domain, we also created software which automatically extracted approximately 20,000 lexicon entries. However, we did not have enough time to explore the use of these entries. In order to explain the assignment of semantic categories to a given term using Roget's Thesaurus, consider the brief index quotation for the term "vapor": vapor n. fog 404.2 fume 401 illusion 519.1 4.3 [OCRerr]sPe[OCRerr][OCRerr]i 328.10 thing imagined 535[OCRerr] v. be bombastic 601.6 bluster 911.3 boast 910.6 exhale 310.23 talk nonsense 547.5 The eleven different meanings of the term "vapor" are given in terms of a numerical category. We have developed a mapping of the numerical categories in Roget's Thesaurus to the thematic role and attribute categories given in Figure 1. In this example, "fog" and "fume" correspond to the attribute State; "steam" maps to the attribute Temperature; and "ex- 201 hale" is a trigger for the attribute Motion with Reference to Direction. The remaining seven meanings associated with "vapor" do not trigger any thematic roles or attributes. Since there are eleven meanings associated with "vapor," we indicate in the lexicon a probability of 1/11 each time a category is triggered. Hence, a probability of 2/11 is assigned to State, 1/11 to Temperature, and 1/11 to Motion with Reference to Direction. This technique of calculating prob- abilities is being used as a simple alternative to a corpus analysis. It should be pointed out that we are still experimenting with other ways of calculating probabilities. Figure 2 shows lexicon entries for prepositions as asample of the lexicon used in our experiments. These entries are somewhat misleading. Most p repositions trigger too many semantic categories to be of real use. The prepositions "during" and "until" are examples of useful prepositions. 3[OCRerr] Extended Computation of the Similarity Measure The probabilistic details of a semantic lexicon and the computation of semantic weights can be found in [9]. A detailed explanation of the manner in which we combine semantic weights and keyword weights can be found in [8]. Essentially we treat semantic categories like indexing terms, and the probabilities introduced by a semantic lexicon mean that the frequency of a category in a document becomes an [OCRerr] frequency and the presence of a category in a document becomes a probability for the category being present. This means that the document frequency for a category becomes an expected document frequency, and this enables an inverse document frequency to be calculated for a category.