SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection chapter E. Liddy S. Myaeng National Institute of Standards and Technology Donna K. Harman which are extremely useful in natural language processing tasks, e.g. the Subject Codes. The Subject Codes are based on a classification scheme of 124 major fields and 250 sub-fields. Subject Codes are manually assigned to words in LDOCE by the Longman lexicographers. There is a potential problem, however, with the Subject Code assignments which become obvious when an attempt is made to use them computationally. Namely, a particular word may function as more than one part of speech and each word may also have more than one sense, and each of these entries and/or senses may be assigned different Subject Codes. This is a slight variant of the standard disambiguation problem1 which has shown itself to be nearly intractable for most NLP applications, but which needed to be successfully handled if DR-LINK was to produce correct semantic SFC vectors. We based our computational approach to successful disambiguation on a study of current psycholinguistic research literature from which we concluded that there is no single theory that can account for all the experimental results on human lexical disambiguation. We interpret the these results as suggesting that there are three potential sources of influence on the human disambiguation process: Local context - the sentence containing the ambiguous word restricts the interpretation of ambiguous words Domain knowledge - the recognition that a text is concerned with a particular domain activates only the senses appropriate to that domain Frequency data - the frequency of each sense's general usage affects its accessibility We have computationally approximated these three knowledge sources in our disambiguator. We consider the `uniquely assigned' and `high-frequency' SFCs of words within a single sentence as providing the local context which suggests the correct SFC for an ambiguous word. The SFC correlation matrix which was generated by processing a corpus of 977 Wall Street Journal (WSJ) articles containing 442,059 words, equates to the domain knowledge (WSJ topics) that is called upon for disambiguation if the local context does not resolve the ambiguity. And finally, ordering of SFCs in LDOCE replicates the frequency-of-use criterion. We implement the computational disambiguation process by moving in stages from the more local level to the most global type of disambiguation, using these sources of information to guide the disambiguation process. The work is unique in that it successfully combines large-scale statistical evidence with the more commonly espoused local heuristics. We tested our SFC disambiguation procedures on a sample of twelve randomly selected WSJ articles containingi 66 sentences consisting of 1638 words which had SFCs in LDOCE. The system implementation of the disambiguation procedures was run and a single SFC was selected for each word. These SFCs were compared to the sense-selections made by an independent judge who was instructed to read the sentences and the definitions of the senses of each word and then to select that sense of the word which was most correct. The disambiguation implementation selected the correct SFC 89% of the time (Longman's Dictionary of Contemoorary English (LDOCE). Operationally, the SFCoder tags each word in a document with the appropriate, disambiguated (SFC). The within-document SFCs are then summed and normalized to produce a vector of the SFCs representing that document. Topic statements are likewise represented as SFC vectors. In 117