SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
WORDIJ: A Word Pair Approach to Information Retrieval
chapter
J. Danowski
National Institute of Standards and Technology
Donna K. Harman
WORDIJ: A WORD-PAIR APPROACH TO INFORMATION RETRIEVAL
James A. Danowski
University of Illinois at Chicago
CONCEPTUAL MODEL
WORDij is a system based on a linkage or network
model for representing textual information. The
fundamental unit of analysis is the word pair, or bi-gram
phrase, rather than the individual term. WORDij also takes a
local approach to term cooccurrence. Systems such as
SMART historically used the entire document as the field
within which to define term cooccurrence. More recent
reseaach has suggested that defining cooccurrence within
smaller text units such as paragraphs may be better [Salton &
Buckley 91]. WORDij is even more local in focus. It
defines cooccurrence of terms within three word positions
(after dropping stop words). In addition, WORDij uses
direct and indirect pair information to compute shortest paths
among words in retrieved documents. This counts both
direct and indirect matches between queries and documents.
Consider a query Q containing the phrase (ti, [OCRerr]) and a
documentD containing the phrases (ti, [OCRerr]), and (t2, [OCRerr]) but
not the phrase (ti, t3). Existing algorithms [Salton &
Buckley 91, Croft, Turtle & Lewis 91, Fagan 89) would not
consider the dependency between ti and t3 as there is no
match for the phrase. However, trecAependency models
[van Rijsbergen 77; Yu, Buckley, Lam and Salton 83)
recognize such indirect dependencies and produce a formula
to compute the degree of dependency between ti and [OCRerr].
The WORDij approach considers not only the direct phrases
but also indirect phrases.
METHODS
TREC work was begun using a network of Sun
workstations in the Database and Information Systems
Laboratory in the Electrical Engineering and Computer
Science Department at the University of Illinois at Chicago.
Because the lead Research Assistant, Nainesh Khimasia,
died during the project, software development using C and
Unix tools was impeded. Earlier generations of tools had
been optimized for an IBM mainframe computer, so work
was switched to that platform. The machine used was an
IBM 3090/300J platform running VMXA, CMS. A virtual
machine CPU size of l6meg was used along with three
gigabytes of disk space. The CPU clock speed is rated at
14.5 nanoseconds, or 69 MHz.
We modified earlier generations of WORDij software
written in SPITBOL [Danowski 82, Danowski & Andrews
85). These modifications consisted mainly of replacing
some SPIThOL code where possible with CMS PIPELINE
code, because it runs approximately one thousand times
faster. The [OCRerr].Z text files were uncompressed using a
compress utility on CMS that works with Unix based
compressed files. WORDij code was run on each
uncompressed text file, generating an inverted file of word
pairs by document identification numbers. All word pairs
occurring only once in each document were dropped to save
disk space.
No spell checking, stemming, morphological analysis,
parsing, or tokenizing was done. A stop list of 631 words
was used, comprised of the 570 stop words in SMART v.10
and some additional stop words forming the markup format
of the raw text. Processing time to create the word pair
index averaged three minutes per file.
Ad hoc queries were automatically processed in the
same way as raw documents, except that no single pairs were
dropped. Query text used to generate word pairs for
matching included all text provided, except the factors and
definitions, and concepts numbered higher than two. Total
CPU seconds to build a query averaged .26 seconds. For the
ad hoc queries, nothing further was done to them, either
automatically or manually.
For the routing topics, queries were also constructed
automatically, but in a different way. The training sets of
relevant and irrelevant documents were separately analyzed
to identify all word pairs that occurred in the relevant set but
not in the irrelevant seL These unique relevant word pairs
were used as routing queries.
131