NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Incorporating Semantics Within a Connectionist Model and a Vector Processing Model chapter R. Boyd J. Driscoll National Institute of Standards and Technology D. K. Harman 36 semantic categories becomes present in every document. This causes semantic category weights to become very low and useless within that approach. As [OCRerr]reportedin[4], one way to solve this problem is to break ThECdocum ents into paragraphs. But, another way to solve the problem of long documents causing semantic weights to be of little value is to have more semantic categories. A large number of "semantic" categories can be obtained (for example) by using [OCRerr] the categories and/or subcategories found in Roget's Thesaurus, instead of the 36 semantic categories we have used. This may be a deviation from database semantic modeling. In any case, it needs to be examined. Consequently, for the experiments reported here, a semantic lexicon was created based on all the word senses found in the public domain 1911 version of Roget's The- saurus. To provide an example, consider Topic 052 as shown in Figure 2. Fi[OCRerr]re 3 indicates the keywords and frequency information within Topic 052, along with the semantic categories obtained from our extended lexicon for those keywords. Note that stemming was not used for the pro- cessing of Topic 052; so, some keywords in Topic 052 were not located in our lexicon (e.g. sanctions). The categories recorded in our extended semantic lexicon usethe category numbers found in the 1911 version of Roget's Thesaurus. These numbers are then followed by a part-of- speech code also found in the 1911 version of Roget's Thesaurus. The number after the part-of-speech code represents a sub-category, but this number does not appear in the 1911 version of Roget's Thesaurus. That number was created based on groupings of words within the thesaurus. ~op> <head> TIpster Topic Description <num> Number: 052 <dom> Domain: International Economics <titie> Topic: South African Sanctions <desc> Description: Document discusses sanctions against South Africa <narr> Narrative: A relevant document will discuss any aspect 0' South African sanctions, such as: sanctions dccl[OCRerr](po[OCRerr] by a country against the South African government in response to its apaitheid poncy, or in response topressure by an indIvidual, organization or another country; intemational sanctions against Pretoria imposed bythe United Nations; the effects 0' sanctions against & Africa; opposition to sanctions; or' compliance with sanctions by a company. The document will identif[OCRerr] the sanctions instituted or being considered, e.g., corporate disinvestment, trade ben' academic boycott, arms embargo. <con> Concept(s): 1. sanctions, international sanctions, economic sanctions 2. corporate exodus, corporate disinvestment, stock divestiture, ben on new investment, trade ban, import ben on South African diamonds, U.N. arms embargo, curtailment 0' delbrise contracts, cutoff 0' nonmUitary goods, academic boycott, reduction 0' cultural ties 3. ap&theid, white domination, racism 4. an-theid, black m[OCRerr]Jority rule 5. Pretoria [OCRerr]c> Factor(s): <nat> Nationality: South Africa <`lac> <de[OCRerr] D[OCRerr]flnition(s): 3. Connectionist Model Routing Experiments Recent work suggests that significant improvements in retrieval performance will require a technique that, in some sense, "understands" the content of documents and queries and can be used to infer probable relationships between documents and queries [2]. In this view, information retrieval is an inference or evidential reasoning process in which we estimate the probability that a user's information need is met given a document as "evidence". The techniques required to support this kind of inference are similar to those used in expert systems that must reason with uncertain information. Several probabilistically[OCRerr]oriented inference network models have been developed using experimental document collec- tions [5] during the past few years for information retrieval [15]. These models are generally characterreed by an architecture with two layers corresponding to documents and index terms. The documents and index terms are connected by direct links. Initially, the prior probabilities of all root nodes (nodes with no predecessors) and the conditional probabilities of all non-root nodes [OCRerr]iven all possible combinations of their direct predecessors) must be specified. Aretrievalconsists of one or more documents with the highest posterior probability for the given set of index terms (evi- dences) which represent a user's information need. Over the last few years, the technique of automated inference using probabilistic inference networks has become popularwithin theM probability and uncertainty community, particular in the context of expert systems [6,7]. The most 293 Figure 2. Topic 052. important constraint on the use of a probabilistic network is the fact that in general, the computation of the exact posterior probabilities is NP-hard [1]. Thus it is unlikely that we could develop an efficient general-purpose algorithm which would work well for all kinds of inference networks. There are several alternatives, such as the use of approximation algo- rithms or heuristic algorithms, and creating special case algorithms [9,10]. The experiments here concern an attempt at a heuristic probabilistic inference network approach based on an Al connectionist model. The connectionist model uses a com- petitive activation rule to find the most probable retrievaL The term competitive activation rule refers to a spreading activation method in which nodes actively compete for available activation in a network. An initial formulation of a competitive activation mechanism was previously studied on three tw[OCRerr]layer, abstract networks for diagnostic problem solving [11,13]. The connectionist model proposed here consists of a two-layer network architecture. Document nodes and index term nodes corresponding to each layer are connected by links whose weights represent association strengths between nodes. These links are also viewed as channels for sending information between nodes. Figure 4 is a simple network consisting of two document nodes and three index term nodes. At each moment of time, each node