SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
WORDIJ: A Word Pair Approach to Information Retrieval
chapter
J. Danowski
National Institute of Standards and Technology
Donna K. Harman
Table 4: Direct and indirect links to the word "aid"
FREQUENCY
aid airbus 1
fdp 1
back 1
jets 2
adams 1
board 1
crash 1
group 1
plans 1
airbus 1
boeing 2
family 1
german 1
member 1
planes 2
dispute 1
mandate 1
nations 2
partner 2
percent 1
program 2
provide 1
aircraft 1
european 1
products 1
projects 2
amendment 1
executive 1
industrie 9
initially 1
ministers 1
spokesman 2
structure 1
* subsidies 2
violating 1
consortium 5
* government 1
management 1
aid package 1
aid guarantee 1
aid consortium
1
level. Nevertheless, they are listed
to show the larger context of
identifying meaningful indirectness.
to indirect association patternS beloW the manifest or direct
level. Currently, elgenvector solutions to large matrices are
more computationally limited than shortest path network
solutions. There has heen more development of large scale,
parallel algorithms for shortest pall's, due to the practical
needs to aid routing of information in telecommunications
networks. Some work, however, suggests that there is a
mathematical equivalence between eigenvector and network
approaches to reducing matrices of associations to a simpler
underlying structure [Barnett & Richards 91].
Shortest Path Weighting.
Given a set of query word pairs and a list of all
documents that contain each word pair--hoth directly and
indirectly-- we can take all pairs of nodes and identify the
shortest path linking them in the network. These paths are
measured for length according to Euclidean distance in graph
terms. Such distance is a direct function of the minimum
number of link steps it requires to connect two nodes on their
geodesic. Directly linked nodes have a distance of one,
nodes linked through one common intermediary node have a
distance of two, elc. Documents are counted that were
"passed through" or "activated" as each step in the shortest
path is traversed. Shortest path algorithms can find these
indirect paths with large data sets provided parallel
algorithms and hardware are used. We are further
developing such experiments.
After IDF weighting, ranking, and selection of the best
words, network analysis is conducted on the word pairs they
form. The shortest paths linking every word in the set are
found, and the word centrality in the network is indexed via
the average of the minimum number of steps between that
word and all other words in the sel
Then, for each document, it is given a weight that is
based on the centrality of the words from the query it
contains. The retrieved documents found along the shortest
paths between all query pairs are counted and weighted by
their constituent word centrality. ank ordered
for each query. Documents are then rank[OCRerr]rdered for each
query.
airbus 1 CONCLUSION
* These are indirect links that
create pairs contained in the
query pairs: aid- (Airbus) -subsidies
and aid-(Airbus)-government. The
other indirect links are not
meaningful because they do not
relate to the query at the two-step
Results showed that even with unexpected limitations
due to the mid-project death of the lead research assistant,
Nainesh Khimasia, we succeeded in processing the entire
TREC collection and doing direct matching of query word
pairs to document word pairs. For 15% of the topics, our
results can be considered failures. Failure analysis suggests
that improvements in future research may result from:
135