SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
RECENT DEVELOPMENTS IN NATURAL LANGUAGE TEXT RETRIEVAL
Tomek Strzalkowski and Jose Perez Carballo
Courant Institute ofMathematical Sciences
New York University
715 Broadway, rm. 704
New York, NY 10003
tomek@cs.nyu.edu
ABSTRACT
This paper reports on some recent developments in our
natural language text retrieval system. The system uses
advanced natural language processing teclmiques to
enhance the effectiveness of term-based document
retrieval. The backbone of our system is a traditional sta-
tistical engine which builds inverted index files from
pre-processed documents, and then searches and ranlls
the documents in response to user queries. Natural
language processing is used to (1) preprocess the docu-
ments in order to extract content-carrying terms, (2) dis-
cover inter-term dependencies and build a conceptual
hierarchy specific to the database domain, and (3) pro-
cess user's natural language requests into effective
search queries. For the present ThEC-2 effort, the total
of 550 MBytes of Wall Street Journal articles (ad-hoc
queries database) and 300 MBytes of San Jose Mercury
articles (routing data) have been processed. In terms of
text quantity this represents approximately 130 million
words of Fnglish. Unlilte in IREC-1, we were able to
create a single compound index for each database, and
therefore avoid merging of results. While the general
design of the system has not changed since IREC-1
conference. we nonetheless replaced several components
and added a number of new features which are described
in the present paper.
INTRODUCTION
A typical information retrieval (IR) task is to select
documents from a database in response to a user's query,
and rank these documents according to relevance. This
has been usually accomplished using statistical methods
(often coupled with manual encoding) that (a) select
terms (words, phrases, and other units) from documents
that are deemed to best represent their content, and (b)
create an inverted index file (or files) that provide an
easy access to documents containing these terms. A sub-
sequent search process will attempt to match a prepro-
cessed user query (or queries) against term-based
representations of docurnents in each case determing a
degree of relevance between the two which depends
upon the number and types of matching terms. Although
123
many sophisticated search and matching methods are
available, the crucial problem remains to be that of an
adequate representation of content for both the docu-
ments and the queries.
The simplest word-based representations of con-
tent are usually inadequate since single words are rarely
specific enough for accurate discrirnation, and their
grouping is often accidental. A better method is to iden-
tify groups of words that create meaningful phrases,
especially if these phrases denote iniportant concepts in
database domain. For example, joint venture is an impor-
tant term in the Wall Street Joumal (WSJ henceforth)
database, while neither joint nor venture is important by
itself. In the retrieval experiments with the training
`IREC database, we noticed that both joint and venture
were dropped from the list of terms by the system
because their idf (inverfrd document frequency) weights
were too low. In large databases, such as TIPSTER, the
use of phrasal terms is not just desirable, it becomes
necessary.
An accurate syntactic analysis is an essential prere-
quisite for selection of phrasal terms. Various statistical
methods, e.g., based on word co-occuirences and mutual
information, as well as partial parsing technlques, are
prone to high error rates (sometimes as high as 50%),
tuuing out many unwanted associations. Therefore a
good, fast parser is necessary, but it is by no means
sufficient. While syntactic phrases are often better indi-
cators of content than `statistical phrases' -- where words
are grouped solely on the basis of physical proximity
(e.g., "college junior" is not the same as "junior college")
-- the creation of compound terms makes term matching
process more complex since in addition to the usual
problems of synonymy and subsumption, one must deal
with their structure (e.g., "college junior" is the same as
jumor m college"). In order to deal with structure, the
parser's output needs to be "normalized" or "regularized"
so that complex terms with the same or closely related
meanings would indeed receive matching representa-
tions. This goal has been achieved to a certain extent in
the present work. As it will be discussed in more detail
below, indexing terms were selected from among head-
modifier pairs extracted from predicate-argument