SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- DR LINK's Linguistic-Conceptual Approach to Document Detection
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
Donna K. Harman
modules, the system description will also be organized according to these same divisions. The
documents and the topic statements are analyzed in basically similar ways by the system's
modules, with a few exceptions which will be detailed below within that module's description.
Z&L Pre-ProcessinQ
We have chosen to perform rather substantive pre-processing of the raw text that is received
from DARPA because much of our later processing is dependent on clean, well-demarcated text.
For example, we identify sub-headlines, and embedded figures, as well as identifying and
restoring correct sentence boundaries. Also, we identify multiple stories within a single
document so that each separate story can have its own SF0 vector produced which will be
representative of just that one story, and so that accurate text structuring can be accomplished
at the individual story level.
The text is then processed by the POST part-of-speech tagger (Meteer et al, 1991) loaned to us
by BBN, that stochastically attaches a part-of-speech tag to individual words. The part-of-
speech tagged text is then fed into a bracketer, a deterministic finite state automaton that adds
several different types of brackets for linguistic constituents (e.g. noun phrases, prepositional
phrases, clauses, etc.) essential for several tasks in our system.
2.b. Subiect Field Coder
The Subject Field Coder (SFCer) produces a summary-level semantic representation of a text's
contents that is useable either for ranking a large set of incoming documents for their broad
subject appropriateness to a standing query, or for dividing a database into clusters of
documents on the same topic. One important benefit of the the SF0 representation is that it
implicitly handles both the synonymy and polysemy problems which have plagued the use of NLP
in l.R. systems because this representation is one level above the actual words in a text.
For example, Figure 2 presents a short WSJ article and a humanly readable version of the
normalized SF0 vector which serves as the document's semantic summary representation.
A U.S. magistrate in Flodda ordered Cados Lehder Rivas, described as among the
wodd's leading cocaine traffickers, held wfthout bond on 11 drug-smuggllng
count. Lehder, who was caotured last week in Colombia and immediately extradited
to the U.S., pleaded innocent to the charges in federal court in Jacksonvme.
LAW .2667 .1333
BUSINESS .1333 [OCRerr]NOMlOB .0667
DRUGS .1333 MILITARY .0667
POUTICAL SCIENCE .1333 OOCUPATIOI[OCRerr] .0667
Fig. 2: Sample WSJ document and its SF0 representation
The SFCer uses the Subject Codes from Longman's Dictionarv of Contemporary English (LDOCE)
to produce this semantic representation of a text's contents. The machine-readable tape of the
1987 edition of LDOCE contains 35,899 headwords and 53,838 senses, for an average of 1.499
senses per headword plus several fields of information not visible in the hard-copy version
116