SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Incorporating Semantics Within a Connectionist Model and a Vector Processing Model chapter R. Boyd J. Driscoll National Institute of Standards and Technology D. K. Harman In Section 2, we describe our original semantic lexicon and an extension which uses a larger number of semantic categories. Section 3 presents an application of an Al connectionist model to the task of routing. Section 4 presents an approach different than reported in TREC-1 [4], using our extended semantic lexicon within the vector processing model. Section 5 summarizes our rasearch effort. 2. The Semantic Lexicon Our semantic approach uses a thesaurus as a source of semantic categories (thematic and attribute information). For example, Roget's Thesaurus contains a hierarchy of word classes to relate word senses [14]. In TREC-1 [4] and in earlier research [17,19], we selected several classes from this hierarchy to be used for semantic categories. We defined thirty-six semantic categories as shown in Figure 1. In order to explain the assignment of semantic categories to a given term using Roget's Thesaurus, consider the brief index quotation for the term "vapor": vapor n. fog 404.2 fume 401 illusion 519.1 spirit 4.3 steam 328.10 thing imagi[OCRerr]ed 535.3 v. be bombastic 601.6 bluster 911.3 boast 910.6 exhale 310.23 talk nonsense 547.5 The eleven different meanings of the term "vapor" are given in terms of a numerical category. We developed a mapping of the numerical categories in Roget's Thesaurus to the thematic role and attribute categories given in Figure 1. In this example, "fog" and "fume" correspond to the attribute State; "steam" maps to the attribute Temperature; and "ex- hale" is a trigger for the attribute Motion with Reference to Direction. The remaining seven meanings associated with "vapor" do not trigger any thematic roles or attributes. Since there are eleven meanings associated with "vapor," we indicated in the lexicon a probability of 1/11 each time a category is triggered. Hence, a probability of 2/11 is assigned to State, 1/11 to Temperature, and 1/11 to Motion with Reference to Direction. This technique of calculating prob- abilities is being used as a simple alternative to a corpus analysis. It should be pointed out that we are still experimenting with other ways of calculating probabilities. For example, as in [8], a probabilistic part-of-speech tagger could be used to further restrict the different meanings of a term, and existing lexical sources could be used to obtain an ordering based on frequency of use for the different meanings of a term. As reported in [4], the use of 36 semantic categories caused problems when dealing with TREC documents. When the size of a document is large, a greater number of the 36 semantic categories are triggered in the document. Also, when using the semantic approach described in [19] the probability present for each category in a document is often very close to one. Consequently, almost every one of the Thematic Role Categories Attribute Categories TACM Accomnaniment ACOL Color TAMT Amount AEID External and Internal Dimensions ThNF Beneficiarv AFRM Form TCSE Cause AOND Gender TCND Condition AODM General Dimensions TCMP Comnenson ALDM Linear Dimensions TCNV Conve ance AMFR Motion Conjoined with Foree ThOR De[OCRerr]e AOMT Motion in General ThST Destination AMDR Motion with Reference to Direction ThUR Duration AORD Order TOOL Ooal APIIP Phvsical Pronerties TINS Instrument APOS Position TSPL I:c[OCRerr]tion/Si,ace ASTE State TMAN Manner A[OCRerr]mrature TMNS Means AUSE Use ThUR Purpc[OCRerr]e AVAR Variation ThNO Ran[OCRerr] i[OCRerr]FS Result TSRC Source TTIM Time Figure 1. Thirty-Six Semantic Categories. 292