SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Incorporating Semantics Within a Connectionist Model and a Vector Processing Model
chapter
R. Boyd
J. Driscoll
National Institute of Standards and Technology
D. K. Harman
36 semantic categories becomes present in every document.
This causes semantic category weights to become very low
and useless within that approach.
As [OCRerr]reportedin[4], one way to solve this problem is to
break ThECdocum ents into paragraphs. But, another way
to solve the problem of long documents causing semantic
weights to be of little value is to have more semantic
categories. A large number of "semantic" categories can be
obtained (for example) by using [OCRerr] the categories and/or
subcategories found in Roget's Thesaurus, instead of the 36
semantic categories we have used. This may be a deviation
from database semantic modeling. In any case, it needs to be
examined.
Consequently, for the experiments reported here, a
semantic lexicon was created based on all the word senses
found in the public domain 1911 version of Roget's The-
saurus. To provide an example, consider Topic 052 as shown
in Figure 2. Fi[OCRerr]re 3 indicates the keywords and frequency
information within Topic 052, along with the semantic
categories obtained from our extended lexicon for those
keywords. Note that stemming was not used for the pro-
cessing of Topic 052; so, some keywords in Topic 052 were
not located in our lexicon (e.g. sanctions).
The categories recorded in our extended semantic lexicon
usethe category numbers found in the 1911 version of Roget's
Thesaurus. These numbers are then followed by a part-of-
speech code also found in the 1911 version of Roget's
Thesaurus. The number after the part-of-speech code
represents a sub-category, but this number does not appear
in the 1911 version of Roget's Thesaurus. That number was
created based on groupings of words within the thesaurus.
~op>
<head> TIpster Topic Description
<num> Number: 052
<dom> Domain: International Economics
<titie> Topic: South African Sanctions
<desc> Description:
Document discusses sanctions against South Africa
<narr> Narrative:
A relevant document will discuss any aspect 0' South African sanctions, such
as: sanctions dccl[OCRerr](po[OCRerr] by a country against the South African
government in response to its apaitheid poncy, or in response topressure by
an indIvidual, organization or another country; intemational sanctions against
Pretoria imposed bythe United Nations; the effects 0' sanctions against &
Africa; opposition to sanctions; or' compliance with sanctions by a company.
The document will identif[OCRerr] the sanctions instituted or being considered, e.g.,
corporate disinvestment, trade ben' academic boycott, arms embargo.
<con> Concept(s):
1. sanctions, international sanctions, economic sanctions
2. corporate exodus, corporate disinvestment, stock divestiture, ben on new
investment, trade ban, import ben on South African diamonds, U.N. arms
embargo, curtailment 0' delbrise contracts, cutoff 0' nonmUitary goods,
academic boycott, reduction 0' cultural ties
3. ap&theid, white domination, racism
4. an-theid, black m[OCRerr]Jority rule
5. Pretoria
[OCRerr]c> Factor(s):
<nat> Nationality: South Africa
<`lac>
<de[OCRerr] D[OCRerr]flnition(s):
3. Connectionist Model Routing Experiments
Recent work suggests that significant improvements in
retrieval performance will require a technique that, in some
sense, "understands" the content of documents and queries
and can be used to infer probable relationships between
documents and queries [2]. In this view, information retrieval
is an inference or evidential reasoning process in which we
estimate the probability that a user's information need is met
given a document as "evidence". The techniques required to
support this kind of inference are similar to those used in
expert systems that must reason with uncertain information.
Several probabilistically[OCRerr]oriented inference network models
have been developed using experimental document collec-
tions [5] during the past few years for information retrieval
[15]. These models are generally characterreed by an
architecture with two layers corresponding to documents and
index terms. The documents and index terms are connected
by direct links. Initially, the prior probabilities of all root
nodes (nodes with no predecessors) and the conditional
probabilities of all non-root nodes [OCRerr]iven all possible
combinations of their direct predecessors) must be specified.
Aretrievalconsists of one or more documents with the highest
posterior probability for the given set of index terms (evi-
dences) which represent a user's information need.
Over the last few years, the technique of automated
inference using probabilistic inference networks has become
popularwithin theM probability and uncertainty community,
particular in the context of expert systems [6,7]. The most
293
Figure 2. Topic 052.
important constraint on the use of a probabilistic network is
the fact that in general, the computation of the exact posterior
probabilities is NP-hard [1]. Thus it is unlikely that we could
develop an efficient general-purpose algorithm which would
work well for all kinds of inference networks. There are
several alternatives, such as the use of approximation algo-
rithms or heuristic algorithms, and creating special case
algorithms [9,10].
The experiments here concern an attempt at a heuristic
probabilistic inference network approach based on an Al
connectionist model. The connectionist model uses a com-
petitive activation rule to find the most probable retrievaL
The term competitive activation rule refers to a spreading
activation method in which nodes actively compete for
available activation in a network. An initial formulation of
a competitive activation mechanism was previously studied
on three tw[OCRerr]layer, abstract networks for diagnostic problem
solving [11,13]. The connectionist model proposed here
consists of a two-layer network architecture. Document
nodes and index term nodes corresponding to each layer are
connected by links whose weights represent association
strengths between nodes. These links are also viewed as
channels for sending information between nodes. Figure 4
is a simple network consisting of two document nodes and
three index term nodes. At each moment of time, each node