SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
from the WSJ training database:
GTS (takeover)
GTS (merge)
GTS (buy -out)
GTS (acquire)
with
=0.00145576
=0.00094518
=0.00272580
=0.00057906
SIM (takeover,merge) = 0.190444
SIM (takeover,buy -out) =0.157410
SIM (takeover,acquire) =0.139497
SIM (merge, buy -out) = 0.133800
SIM (merge,acquire) =0.263772
SIM (buy -out,acquire) = 0.109106
Therefore both takeover and buy-out can be used to spe-
cialize merge or acquire. With this filter, the relation-
ships between takeover and buy-out and between merge
and acquire are either both discarded or accepted as
synonymous. At this time we are unable to tell
synonymous or near synonymous relationships from
those which are priinarily complementary, e.g., man and
woman.
In ThEC-1 the impact of query expansion through
term similarities on the system's overall performance
was generally disappointing. For TREC-2 we have made
a number of changes to the term cottelation model, but
again time limitations prevented us from properly testing
all options. Among the most important changes are:
(1) Exclusion of pairs obtained from SUBJEGF--
VERB relations: we detennined that these con-
texts are generally of litfie use as neither subject
nor verb subeategorizes well for the other. More-
over we observed that the presence of these pairs
was the source of many unwanted term associa-
tions.11
(2) Automatic pruning of low[OCRerr]ontent terms from the
queries: terms with low idf weights, terms with
low information contribution weights that are
elements of compound terms, are removed from
queries before database search. As we tuned
various cutoff thresholds we noted that a
significant increase in both recall and precision
could be obtained. 12
Subject-Verb pairs were retained as eompound terms, however.
12 The Information Contribution Ineasure indicates the strength of
j;j
word pairings, and is defined as IC (x, fx,y]) = where f,[OCRerr] is
n,+d[OCRerr]-l
the absolute frequency of pair [x,y] in the corpus, n, is the frequency of
term x at the head position, and d[OCRerr] is a dispersion parameter understood
as the number of distinct Syntactic contexts in which term x is found.
129
word cluster
takeover merge, buy-out, acquire, bid
benefit compensate, aid, expense
capital cash, jund, money
staff personnel, employee, force
attract lure, draw, woo
sensitive crucial, difficult, critical
speculate rumor, uncertainty, tension
president director, executive, chairman
vice deputy
outlook forecast, prospect, trend
law rule, policy, legislate, bill
earnings profit, revenue, income
portfolio asset, invest, loan
inflate growth. demand, earnings
industry business, company, market
growth increase, rise, gain
firm bank, concern, group, unit
environ climate, condition, situation
debt loan, secure, bond
lawyer attorney
counsel attorney, administrator, secretary
compute machine, sofiware, equipment
competitor rival, competition, buyer
alliance partnership, venture, consortium
big large, major, huge, significant
fight battle, attack, war, challenge
base facile, source, reserve, support
shareholder creditor, customer, client
investor, stockholder
Table 1. Selected clusters obtained from syntactic contexts, derived
from approx. 40 million words of WSJ text, with weighted Tanimoto
formula.