SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
WORDIJ: A Word Pair Approach to Information Retrieval
chapter
J. Danowski
National Institute of Standards and Technology
Donna K. Harman
Stemming.
Tests were run with the training sets for three queries
selected at random: 2,26, and 49. For query 2 the
difference was zero. For query 26, the relevant documents
retrieved increased by 43%, while for query 49 there was a
73% improvement. Average improvement for the three
queries was 37% using stemming.
All Pairs.
Tests were run for three different queries to examine
effects of dropping single occurring word pairs from
documents. Queries 51, 71, and 78 were chosen at random.
Retrieval of relevant documents increased on average by
75%, with varied results across queries. Query 51 saw
relevant documents retrieved increase by 2.25 times, query
71 decreased performance by .93, and query 78 increased by
11 times, for an average of 1.75 times increase in
performance.
indirect Match Tests.
The training set of documents for query 51, about
Airbus subsidies, was used to test indirectness effects. One-
step indirectness was assessed, meaning that two query pair
words were not direcdy in the document, but were indirectly
connected through an intermediary word.
To illustrate, here are the query pairs including the
word, "aid," none of which have any direct matches in the
documents:
AID LOAN
AID TRADE
AID FINANCING
AID SUBSIDIES
AID ASSISTANCE
AID GOVERNMENT
Table 4 contains the direct (one-step) and indirect (tw[OCRerr]
step) links that "aid" had in the documents. The leftmost
pairs are direct links, while the rightmost words were
directly linked only to the second word of the direct pairs,
thus forming a two-step indirect link to the first word in the
pair. For example, "aid" is linked to "government" only
indirecfly through "Airbus." Also, "aid" is linked to
"subsidies" only indirectly through "Airbus." These two sets
of indirect links, aid-(Airbus)-subsidies and
aid-(Airbus)-government are meaningful in terms of the
content of the query, which generally concerns government
aid and subsidies to Airbus. If we had used only directly
matching pairs, we would have missed these two
conceptually meaningful sets of links. After identifying all
indirect pairs in documents matching query pairs in this way,
retrieval of relevant documents was 12% higher.
Shortest Paths.
WORDij does not restrict detection of indirect phrases
to these dual bi-gram cases. Rather, indirectness can be of n-
step lengths [Danowski and Martin 79, Van Rijsbergen 77].
For example, if there is an intermediate term between two
other terms not otherwise linked, then these two other terms
have an indirect step linkage of two. If the connection is only
through two intermediaries, then the indirect linkage is at
step three, and so on. Shortest path algorithms [Gabow &
Tarjan 89] find the best set of all direct and indirect links
connecting all nodes in a network. Here, this is all words in
the query.
We expect that indirectness at the two-step level may
contribute most to recall-precision effectiveness. At larger
numbers of steps the value of indirect information
diminishes. This is because at the extreme lengths, every
word is indirectly connected to every other word. This is
equivalent to a simple within-document cooccurrence of
words, such as in traditional approaches. It renders useless
the local cooccurrence constraints. Note also that stop word
removal from texts is necessary to represent higher degrees
of indirectness. When stop words are present, they increase
the connectivity of the word network.
Structural Equivalence and Meaning.
In network analysis, attention to the direct links in a
network is called a "cohesion" approach while examining the
degree of similarity in two-step links is called "structural
equivalence" [Burt 90]. Two nodes are structurally
equivalent to the extent that they share the same indirect
links, though they may not be directly linked themselves.
For example, if word A is linked to words C,D[OCRerr] and word B
is linked to words C,D,E, then although A and B are not
directly linked (i.e. show no cohesion), they are structurally
equivalent and maximally similar because they share the
same links.
Research in mathematical sociology and network
analysis has found that structural equivalence is usually
equal to or better than cohesion in accounting for system
behavior. In text analysis using words as nodes, two words
can be considered to share more meaning to the extent they
have overlapping two-step links. Therefore, structural
equivalence of words is meaning equivalence.
Latent Semantic lndexmg and Indirect Pairs.
It is interesting that another approach to indirectness is
Latent Semantic Indexing [OCRerr]SI) [[)eerwester et al. 90;
Dumais 92]. Instead of than using a network approach,
however, it uses an eigenvector model. Eigenvectors
represent the combined effects of direct and indirect
associations among elements in the matrix. "Latent" refers
134