SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
which are extremely useful in natural language processing tasks, e.g. the Subject Codes. The
Subject Codes are based on a classification scheme of 124 major fields and 250 sub-fields.
Subject Codes are manually assigned to words in LDOCE by the Longman lexicographers. There is
a potential problem, however, with the Subject Code assignments which become obvious when an
attempt is made to use them computationally. Namely, a particular word may function as more
than one part of speech and each word may also have more than one sense, and each of these
entries and/or senses may be assigned different Subject Codes. This is a slight variant of the
standard disambiguation problem1 which has shown itself to be nearly intractable for most NLP
applications, but which needed to be successfully handled if DR-LINK was to produce correct
semantic SFC vectors.
We based our computational approach to successful disambiguation on a study of current
psycholinguistic research literature from which we concluded that there is no single theory that
can account for all the experimental results on human lexical disambiguation. We interpret the
these results as suggesting that there are three potential sources of influence on the human
disambiguation process:
Local context - the sentence containing the ambiguous word restricts the
interpretation of ambiguous words
Domain knowledge - the recognition that a text is concerned with a particular
domain activates only the senses appropriate to that domain
Frequency data - the frequency of each sense's general usage affects its
accessibility
We have computationally approximated these three knowledge sources in our disambiguator. We
consider the `uniquely assigned' and `high-frequency' SFCs of words within a single sentence as
providing the local context which suggests the correct SFC for an ambiguous word. The SFC
correlation matrix which was generated by processing a corpus of 977 Wall Street Journal
(WSJ) articles containing 442,059 words, equates to the domain knowledge (WSJ topics) that
is called upon for disambiguation if the local context does not resolve the ambiguity. And finally,
ordering of SFCs in LDOCE replicates the frequency-of-use criterion. We implement the
computational disambiguation process by moving in stages from the more local level to the most
global type of disambiguation, using these sources of information to guide the disambiguation
process. The work is unique in that it successfully combines large-scale statistical evidence
with the more commonly espoused local heuristics.
We tested our SFC disambiguation procedures on a sample of twelve randomly selected WSJ
articles containingi 66 sentences consisting of 1638 words which had SFCs in LDOCE. The
system implementation of the disambiguation procedures was run and a single SFC was selected
for each word. These SFCs were compared to the sense-selections made by an independent judge
who was instructed to read the sentences and the definitions of the senses of each word and then
to select that sense of the word which was most correct. The disambiguation implementation
selected the correct SFC 89% of the time (Longman's Dictionary of Contemoorary English
(LDOCE).
Operationally, the SFCoder tags each word in a document with the appropriate, disambiguated
(SFC). The within-document SFCs are then summed and normalized to produce a vector of the
SFCs representing that document. Topic statements are likewise represented as SFC vectors. In
117