NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman pharmaceutical, respectively):6 firm GEW[OCRerr].58 fxly=9 fx2y=22 industry GEW=O.51 f'cly=[OCRerr]4 f[OCRerr]2y=56 sector GEW=O.61 f'cly=5 fx2y=9 concern GEW=O.50 [OCRerr]ly=l30 [OCRerr]2y=ll5 analyst GEW=O.62 rxly=23 fx2y=8 division GEW[OCRerr].53 fxly=36 fx2y=28 giant GEW=O.62 fxly=15 fx2y=12 Note that while some of these weights are quite low (less than 0.6-- GEW takes values between 0 and 1), thus indicating a low importance context, the frequencies with which these contexts occrred with both terms were high and balanced on both sides (e.g., concern), thus adding to the strength of association. We are now considering addi- tional thresholds to bar low importance contexts from being used in similarity calculation. It may be worth pointing out that the simllarities are calculated using term co-occurrences in syntactic rather than in document-size contexts, the latter being the usual practice in non-linguistic clustering (e.g., Sparck Jones and Barber, 1971; Crouch, 1988; lewis and Croft, 1990). Although the two methods of term clustering may be considered mutually complementary in certain situa- tions, we believe that more and stronger associations can be obtained through syntactic-context clustering, given sufficient amount of data and a reasonably accurate syn- tactic parser.7 QUERY EXPANSION Similarity relations are used to expand user queries with new terms, in an attempt to make the filial search query more comprehensive (addliig synonyms) and/or more pointed (adding specializations).8 It follows that not all similarity relations will be equally useful in query expansion, for instance, complementary and antonymous relations like the one between Australian and Canadian, accept and reject, or even generalizations like from 6 Other conunon contexts, such as compai[OCRerr] or market, have al- ready been rejected because they were paired with too many different words (a high dispersion ratio, see note 12). , Non-syntactic contexts cross sentence boundaries with no fuss, which is heiplul with short, succinct documents (such as CACM abstracts), but less so with longer texts; see also (Grishman et aL, 1986). 8 Ouery expansion (in the sense considered here, though not quite in the same way) has been used in information retrieval research before (e.g., Sparck Jones and Tait, 1934; Harman, 1988), usually with mixed results. An alternative is to use term clusters to create new terms, "meta- terms", and use them to index the database instead (e.g., Crouch, 1988; lewis and Croft, 1990). We found that the query expansion approach gives the system more flexibility, for instance, by making room for hypertext-style topic exploration via user feedback. 128 aerospace to industry may actually harm system's per- formance, since we may end up retrieving many irrelevant documents. On the other hand, database search is likely to miss relevant documents if we overlook the fact that vice director can also be deputy director, or that takeover can also be merge, buy-out, or acquisition. We noted that an average set of similarities generated from a text corpus contains about as many "good" relations (synonymy, specialization) as "bad" relations (antonymy, complementation, generalization), as seen from the query expansion viewpoint. Therefore any attempt to separate these two classes and to increase the proportion of "good" relations should result in improved retrieval. I[OCRerr]s has indeed been confirmed in our experiments where a relatively cmde filter has visibly increased retrieval pre- cision. In order to create an appropriate filter, we devised a global term specificity measure (GTS) which is calcu- lated for each term across all contexts in which it occurs. The general philosophy here is that a more specific word4)hrase would have a more limited use, i.e., a more specific term would appear in fewer distinct contexts. In this respect, GTS is similar to the standard inverted docu- ment frequency (id]) measure except that term frequency is measured over syntactic units rather than document size units.9 Terms with higher GTS values are generally considered more specific, but the specificity comparison is only meaningful for terms which are already known to be similar. The new function is calculated according to the following formula: ICL(w) * ICR (w) if both exist if only ICR (w) exists GTS(w)= ICR (w) L ICL(w) otherwise where (with n[OCRerr], dw >0): ICL(w) =IC([w,_]) = d[OCRerr](n[OCRerr]+[OCRerr]-l) ICR(w)=IC([_,w])= For any two terms w1 and w2, and a constant b> 1, if GTS(w2)> b * GTS(w1) then w2 is considered more specific than w1. In addition, if SIMmrm(W1,W2) = 0> 0, where 0 is an empirically established threshold, then w2 can be added to the query containing term w1 with weight o.l0 For example, the following were obtained 9We believe that measuring term specificity over document-size contexts (e.g., Sparck Jones, 1972) may not be appropriate in this case. In particular, syntax-based contexts allow for processing texts without any internal document structure. 10 For TREC-2 we used [OCRerr] = 0.2; 5 varied between 10 and 100.