SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
cutoffs could be different for synonymy and speciali-
zation. For example, the following were obtained
from TRBC WSJ training database:
with
ICW (child) - 0.000001
ICW (baby) - 0.000013
JCW (infant) - 0.000055
SJM (child, infant) - 0.131381
SIM (baby, child) - 0.183064
SJM (baby, infant) -0.323121
Therefore both baby and infant can be used to spe-
cialize child (with 6t = 10), while baby and infant
can be considered synonyms (with [OCRerr] = 5). Note that
if [OCRerr] is well chosen, then the above filter will also help
to reject antonymous and complementary relations,
such as SIMnorm(back,front)=0.305287 with
ICW (back )=0.000001 and ICW ("rnntY-0.000006.
We continue working to develop more effective
filters. Bxwnples of filtered similarity relafions
obtained from TREC corpus are listed in Tables 2
and 3.
SUMMARY OF RESULTS
We have processed the total of 500 MBytes of
[OCRerr]trticles from Wall Street Journal section of ThEC
database. Runs were made independently on training
corpus (250 MBytes) and the test corpus (another 250
Mbytes). Natural language processing of each half of
the corpus, including tagging, stemming, parsing and
pair extraction required approximately 2 weeks of
nearly uninterrupted processing on two Sparc works-
tations (we use an assortment of SparcStation 1, BLC
and 2, with MWS rates varying from 13 to 21 to
28.5). 22 Computing term similarities from the first
250 Mbytes of text took up another week. This time
limitations in our computing resources, mostly
memory, were the main reason. With a sufficient
amount of RAM (on the order of 256 MBytes or
more) we estimated that this process should take no
more than 20 hours. It should be noted that we com-
puted term similarities ktsed only on the training
corpus, but we used them in retrieval freim either
database. We assumed that the underlying set of
concepts and describing them terms does not vary
greatly between the two databases (the first covered
Wall Street Journal from 1987 to 1989, the second
from 1990 to 1992).
22 Processing of the training corpus took significantly longer
in practice (real time) since we had only one Sparcstation available
at [OCRerr]y given time, and also due too errors made and resulting froin
them need for re-processing soitie parts.
182
Subsequent indexing process performed by the
NIST IR system required additional 2 weeks for each
half of the database, on a single SparcStation 2.23 In
tot£'tl, we created three inverted file indexes: two were
derived from the training corpus, and one from the
test corpus. The first index created from training
corpus did not include compound terms; the remain-
mg two indexes included them. No cumulative index
was produced, since we estimated it would require
more than 6 weeks to build. Some interesting
discrepancies were observed. While the postings file
generated from the training corpus was 25% larger
that the one generated from the test corpus, the
corresponding dictionary of terms was about 25%
smaller in the training corpus. This seems to suggest
that in 1990-92 volumes of Wall Street Journal (test
corpus) a greater number of unique terms were used,
while these terms were on average occurring more
sparsely in the text. This fact may partially under-
mine an earlier assumption that term similarities gen-
erated from training corpus were adequate for test
corpus as well.
During the training stage of TREC we per-
formed only very limited tests, mostly because the
base system retrieval was quite slow on a database of
this size (approx. 173 MBytes, 99+K documents). 24
We run experiments with topics 001 to 005, using
relevance judgements supplied with the training data-
base. The purpose of these tests was to (1) observe
any improvement in performance resulting from the
use of compound phrase terms, (2) tune the term
similarity filter by adjusting certain threshold values
in an attempt to obtain the most effective set of simi-
larities, and (3) note if any manual intervention into
the queries could improve retrieval results. As may
be expected, none of these experiments were con-
clusive, but this was all we could rely upon in the
short time given. First of all, we noticed that system
overall performance (cumulative recall and precision
levels) were what we could have expected from runs
on smaller benchmark collections such as CACM-
3204. This tnay be partially explained by the diver-
sity of the subject tnatter covered in the TREC data,
and also by the character of the queries (i.e., requests
to extract information), but some problem with the
base search engine may also be to blame. 25 We were
23 Indexing process could not l[OCRerr] done in parallel because
NIST system requires serial processing. Moreover, as indexing
progressed, and the partial database grows in size, the process
slowed down significantly from approx. 6 mm/file to 37 mm/file.
24 The system needed approx. 60 minutes to process a quety.
In the middle of the training run we discovered that the
NIST system has been designed to handle databases up to 65K do-
cuments (21b) Subsequently we rewrote portions of the code to fix
this, but we were not certain if other decisions made by the re-
tneval module (e.g., bitmapping, idf cutoffs) were not in fact coun-