SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
On Expanding Query Vectors with Lexically Related Words
chapter
E. Voorhees
National Institute of Standards and Technology
D. K. Harman
<title> Topic: What Backing Does the National Ri[OCRerr]le Association Have?
<desc> Description:
Document must describe or identi:'y supporters o[OCRerr] the National Ri[OCRerr]le
Association (NRA), or its assets.
<narr> Narrative:
To be relevant, a document must describe or name individuals or organizations who are
members o[OCRerr] the NRA, or who contribute money to it. A document is also relevant
i[OCRerr] it quanti[OCRerr]ies the NRA's [OCRerr]inancial assets or identi[OCRerr]ies any other NRA holdings.
<con> Concept(s):
1. National Ri[OCRerr]le Association, NRA
2. contributor, member, supporter
3. holdings, assets, finances
<syn>
{funds, finance, monetary resource, cash[OCRerr]in[OCRerr]hand, pecuniary[OCRerr]resource}
{supporter, protagonist, champion, admirer, booster}
{gun}
Figure 2: Topic 093 and the synonym sets selected for it.
plus a tag indicating the lexical relation through which
the stems are related to the original synset are then
appended to the original query terms.
As an example of the expansion process, consider the
synsets for swing shown in Figure 1. If the synset added
to the topic is the synset containing golf[OCRerr]stroke, and
any number of hyponym (child) links may be traversed,
then the stems of golf, stroke, swing, shot, slice, hook,
drive, putt, approach, chip, and pitch would be added
to the query vector. If hyponym chains are limited to
length one, then chip and pitch would not be added.
If the synset added to the topic is the one containing
swing meaning plaything and any link type may be fol-
lowed for one link, then the stems of swing, mechanical,
device, plaything, toy, playground, and trapeze would be
added to the query.
Stems added through different lexical relations are
kept separate using the extended vector space model
introduced by Fox [3]. Each query vector is comprised
of subvectors of different concept types (called ctypes)
where each ctype corresponds to a different lexical re-
lation. A query vector potentially has eleven ctypes:
one for original query terms, one for synonyms, and
one each for the other relation types contained within
the noun portion of WordNet (each half of a symmet-
ric relation has its own ctype). An original query term
that is a member of a synset selected for that query
appears in both of the respective ctypes. Similarly, a
word that is related to a synset through two different
relations appears in both ctypes.
226
The similarity between a document vector D and an
extended query vector Q is computed as the weighted
sum of the similarities between D and each of the
query's subvectors:
sim(D,Q)= [OCRerr]
ctype
where denotes the inner product of two vectors, Q[OCRerr]
is the ith subvector of Q, and a[OCRerr], a real number, re-
flects the importance of ctype i relative to the other
ctypes. Terms in documents vectors are weighted us-
ing the lnc weights suggested by Buckley et al. [2]; that
is, the weight of a term is set to 1.0 + ln(tf) where tf is
the number of times the term occurs in the document
and is then normalized by the square root of the sum
of the squares of the weights in the vector (cosine nor-
malization). Query terms are weighted using ltN: the
log term frequency factor above is multiplied by the
term's inverse document frequency, and the weights in
the ctype representing original query terms are normal-
ized by the cosine factor. Weights in additional ctypes
are normalized using the length computed for the orig-
inal terms' ctype. This normalization strategy allows
the original query term weights to be unaffected by the
expansion process and keeps the weights in each ctype
comparable with one another.