SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
NATURAL LANGUAGE PROCESSING IN LARGE-SCALE TEXT RETRIEVAL TASKS
Tomek Strzalkowski
Courant Institute of Mathematical Sciences
New York University
715 Broadway, rm. 704
New York, NY 10003
tomek@cs.nyu.edu
ABSTRACT
We developed a prototype text retrieval system
which uses advanced natural language processing
techniques to enhance the effectiveness of key-word
based docuinent retrieval. The backbone of our sys-
tern is a tradifional stafistical engine which builds
inverted index files from pre-processed documents,
and then searches and ranks the documents in
response to user queries. Natural language process-
ing is used to (1) preprocess the documents in order
to extract contents-carrying terms, (2) discover inter-
term dependencies and build a conceptual hierarchy
specific to the database domain, and (3) process
user's natural language requests into effective search
queries. For the present TREC effort, the total of 500
MBytes of Wall Street Journal articles have been
processed in two batches of 250 MBytes each. Due to
time and space limits, two separate inverted indexes
were produced for each half of the data, with a partial
concept hierarchy built from the first 250 Mbytes
only but used for retrieval on either half. Retrieval
were performed independently on both halfs of the
database and the partial results were merged to pro-
duce the final rankings.
INTRODUCTION
A typical information retrieval (IR) task is to
select documents from a database in response to a
user 5 query, and rank these documents according to
relevance. This has been usually accomplished using
statisfical methods (often coupled with manual
encoding) that (a) select terms (words, phrases, and
other units) from documents that are deemed to best
represent their contents, and (b) create an inverted
index file (or files) that provide and easy access to
documents containing these terms. A subsequent
search process will attempt to match a preprocessed
user query (or queries) against term-based represen-
tations of documents in each case determining a
degree of relevance between the two which depends
upon the number and types of matching terms.
Although many sophisticated search and matching
173
methods are available, the crucial problem remains to
be that of an adequate representation of contents for
both the documents and the queries.
The simplest word-based representations of
contents are usually inadequate since single words
are rarely specific enough for accurate discrimina-
tion, and their grouping is often accidental. A better
method is to identify groups of words that create
meaningful phrases, especially if these phrases
denote impor[OCRerr][OCRerr][OCRerr]t concepts in database domain. For
ex[OCRerr]ple, joint venture is an important term in Wall
Street Journal (WSJ henceforth) database, while nei-
ther joint nor venture is import£[OCRerr]int by itself. In the
retrieval experiments with the training TREC data-
base, we noticed that both joint and venture were
dropped from the list of terms by the system because
their idf (inverted document frequency) weights were
too low. In large databases, such as TIPSTER, the
use of phrasal terms is not just desirable, it becomes
necessary.
The question thus becomes, how to identify the
correct phrases in the text? Both statistical and syn-
tactic methods were used before with only limited
success. Statistical methods based on word co-
occurrences and mutual information are prone to high
error rates, turning out many unwanted associations.
Syntactic methods suffered from low quality of gen-
crated parse structures that could be attributed to lim-
ited coverage gr£[OCRerr]nmars and the lack of adequate lex-
icons. In fact, the difficulties encountered in applying
computational linguistics technologies to text pro-
cessing have contributed to a wide-spread belief that
automated natural language processing may not be
suitable in IR. These difficulties included
inefficiency, lack of robustness, and prohibitive cost
of manual effort required to build lexicons and
knowledge bases for each new text domain. On the
other hand, while numerous experiments did not
establish the usefulness of linguistic methods in IR,
they cannot be considered conclusive because of their