NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) On Expanding Query Vectors with Lexically Related Words chapter E. Voorhees National Institute of Standards and Technology D. K. Harman <title> Topic: What Backing Does the National Ri[OCRerr]le Association Have? <desc> Description: Document must describe or identi:'y supporters o[OCRerr] the National Ri[OCRerr]le Association (NRA), or its assets. <narr> Narrative: To be relevant, a document must describe or name individuals or organizations who are members o[OCRerr] the NRA, or who contribute money to it. A document is also relevant i[OCRerr] it quanti[OCRerr]ies the NRA's [OCRerr]inancial assets or identi[OCRerr]ies any other NRA holdings. <con> Concept(s): 1. National Ri[OCRerr]le Association, NRA 2. contributor, member, supporter 3. holdings, assets, finances <syn> {funds, finance, monetary resource, cash[OCRerr]in[OCRerr]hand, pecuniary[OCRerr]resource} {supporter, protagonist, champion, admirer, booster} {gun} Figure 2: Topic 093 and the synonym sets selected for it. plus a tag indicating the lexical relation through which the stems are related to the original synset are then appended to the original query terms. As an example of the expansion process, consider the synsets for swing shown in Figure 1. If the synset added to the topic is the synset containing golf[OCRerr]stroke, and any number of hyponym (child) links may be traversed, then the stems of golf, stroke, swing, shot, slice, hook, drive, putt, approach, chip, and pitch would be added to the query vector. If hyponym chains are limited to length one, then chip and pitch would not be added. If the synset added to the topic is the one containing swing meaning plaything and any link type may be fol- lowed for one link, then the stems of swing, mechanical, device, plaything, toy, playground, and trapeze would be added to the query. Stems added through different lexical relations are kept separate using the extended vector space model introduced by Fox [3]. Each query vector is comprised of subvectors of different concept types (called ctypes) where each ctype corresponds to a different lexical re- lation. A query vector potentially has eleven ctypes: one for original query terms, one for synonyms, and one each for the other relation types contained within the noun portion of WordNet (each half of a symmet- ric relation has its own ctype). An original query term that is a member of a synset selected for that query appears in both of the respective ctypes. Similarly, a word that is related to a synset through two different relations appears in both ctypes. 226 The similarity between a document vector D and an extended query vector Q is computed as the weighted sum of the similarities between D and each of the query's subvectors: sim(D,Q)= [OCRerr] ctype where denotes the inner product of two vectors, Q[OCRerr] is the ith subvector of Q, and a[OCRerr], a real number, re- flects the importance of ctype i relative to the other ctypes. Terms in documents vectors are weighted us- ing the lnc weights suggested by Buckley et al. [2]; that is, the weight of a term is set to 1.0 + ln(tf) where tf is the number of times the term occurs in the document and is then normalized by the square root of the sum of the squares of the weights in the vector (cosine nor- malization). Query terms are weighted using ltN: the log term frequency factor above is multiplied by the term's inverse document frequency, and the weights in the ctype representing original query terms are normal- ized by the cosine factor. Weights in additional ctypes are normalized using the length computed for the orig- inal terms' ctype. This normalization strategy allows the original query term weights to be unaffected by the expansion process and keeps the weights in each ctype comparable with one another.