SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
pharmaceutical, respectively):6
firm GEW[OCRerr].58 fxly=9 fx2y=22
industry GEW=O.51 f'cly=[OCRerr]4 f[OCRerr]2y=56
sector GEW=O.61 f'cly=5 fx2y=9
concern GEW=O.50 [OCRerr]ly=l30 [OCRerr]2y=ll5
analyst GEW=O.62 rxly=23 fx2y=8
division GEW[OCRerr].53 fxly=36 fx2y=28
giant GEW=O.62 fxly=15 fx2y=12
Note that while some of these weights are quite low (less
than 0.6-- GEW takes values between 0 and 1), thus
indicating a low importance context, the frequencies with
which these contexts occrred with both terms were high
and balanced on both sides (e.g., concern), thus adding to
the strength of association. We are now considering addi-
tional thresholds to bar low importance contexts from
being used in similarity calculation.
It may be worth pointing out that the simllarities
are calculated using term co-occurrences in syntactic
rather than in document-size contexts, the latter being the
usual practice in non-linguistic clustering (e.g., Sparck
Jones and Barber, 1971; Crouch, 1988; lewis and Croft,
1990). Although the two methods of term clustering may
be considered mutually complementary in certain situa-
tions, we believe that more and stronger associations can
be obtained through syntactic-context clustering, given
sufficient amount of data and a reasonably accurate syn-
tactic parser.7
QUERY EXPANSION
Similarity relations are used to expand user queries
with new terms, in an attempt to make the filial search
query more comprehensive (addliig synonyms) and/or
more pointed (adding specializations).8 It follows that not
all similarity relations will be equally useful in query
expansion, for instance, complementary and antonymous
relations like the one between Australian and Canadian,
accept and reject, or even generalizations like from
6 Other conunon contexts, such as compai[OCRerr] or market, have al-
ready been rejected because they were paired with too many different
words (a high dispersion ratio, see note 12).
, Non-syntactic contexts cross sentence boundaries with no fuss,
which is heiplul with short, succinct documents (such as CACM
abstracts), but less so with longer texts; see also (Grishman et aL, 1986).
8 Ouery expansion (in the sense considered here, though not quite
in the same way) has been used in information retrieval research before
(e.g., Sparck Jones and Tait, 1934; Harman, 1988), usually with mixed
results. An alternative is to use term clusters to create new terms, "meta-
terms", and use them to index the database instead (e.g., Crouch, 1988;
lewis and Croft, 1990). We found that the query expansion approach
gives the system more flexibility, for instance, by making room for
hypertext-style topic exploration via user feedback.
128
aerospace to industry may actually harm system's per-
formance, since we may end up retrieving many
irrelevant documents. On the other hand, database search
is likely to miss relevant documents if we overlook the
fact that vice director can also be deputy director, or that
takeover can also be merge, buy-out, or acquisition. We
noted that an average set of similarities generated from a
text corpus contains about as many "good" relations
(synonymy, specialization) as "bad" relations (antonymy,
complementation, generalization), as seen from the query
expansion viewpoint. Therefore any attempt to separate
these two classes and to increase the proportion of
"good" relations should result in improved retrieval. I[OCRerr]s
has indeed been confirmed in our experiments where a
relatively cmde filter has visibly increased retrieval pre-
cision.
In order to create an appropriate filter, we devised
a global term specificity measure (GTS) which is calcu-
lated for each term across all contexts in which it occurs.
The general philosophy here is that a more specific
word4)hrase would have a more limited use, i.e., a more
specific term would appear in fewer distinct contexts. In
this respect, GTS is similar to the standard inverted docu-
ment frequency (id]) measure except that term frequency
is measured over syntactic units rather than document
size units.9 Terms with higher GTS values are generally
considered more specific, but the specificity comparison
is only meaningful for terms which are already known to
be similar. The new function is calculated according to
the following formula:
ICL(w) * ICR (w) if both exist
if only ICR (w) exists
GTS(w)= ICR (w)
L ICL(w)
otherwise
where (with n[OCRerr], dw >0):
ICL(w) =IC([w,_]) =
d[OCRerr](n[OCRerr]+[OCRerr]-l)
ICR(w)=IC([_,w])=
For any two terms w1 and w2, and a constant b> 1, if
GTS(w2)> b * GTS(w1) then w2 is considered more
specific than w1. In addition, if SIMmrm(W1,W2) = 0> 0,
where 0 is an empirically established threshold, then w2
can be added to the query containing term w1 with
weight o.l0 For example, the following were obtained
9We believe that measuring term specificity over document-size
contexts (e.g., Sparck Jones, 1972) may not be appropriate in this case.
In particular, syntax-based contexts allow for processing texts without
any internal document structure.
10 For TREC-2 we used [OCRerr] = 0.2; 5 varied between 10 and 100.