SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
CREATING AN INDEX
Most of the problems we encountered when adapt-
ing NIST's PRISE system for use in TREC-2 had to do
with the size of the data that had to be indexed.
We had to deal with the restrictions imposed by the
resources we had (e.g., only % MBytes of virtual
memory). The rest of this paragraph signals some of tile
changes we made to the NIST system in order to deal
with our restrictions. The original system would request
twice the previously requested amount of memory each
time it needed more. As a result of this the system would
reach the limit of virtual memory after only a relatively
small portion of the total number of documents had been
indexed. `flour version, tile memory requested by the
system grows linearly. The increments are estimated in
such a way that the system never requests too much
memory.
The indexing process became too fragile when the
limits of the enviromnent were approached. When a
large portion of the virtual memory and of the disk space
was being used by the indexing process, crashes became
very likely. Unfortunately, it turned out that the process
was very difficult to restart after some crashes (e.g., in
the rebuild phase), thus leading to time consuming
repeats.
Indexing also takes too long at present. Given tile
size of tile data to be indexed the whole process takes at
least 250 hours if everything goes well, which happens
seldom. Given TREC-2's deadlines we could not afford
to perform too many experiments: we barely had time to
index the corpus once.
Most of the previous problems could be solved by
distributing the indexing process to several different
machines and perforting the indexing in parallel.
We believe that it is possible to create several
small indexes instead of a single very large one. if cer-
tain rules are followed when creating the distributed
index, it should be possible to merge the results of query-
ing the set of small indexes and to obtain a performance
(recall and precision) co(nparable to the results obtained
using a single index. The test setup we bullt in order to
perform the experiments required for TREC-2 should
allow us to test these hypotheses. The advantages of a
distributed index are clear:
(1) The indexing process would be faster.
(2) Each one of the distributed indexing processes
would be smaller and less fragile.
(3) Even if one of the distributed processes crashes
restarting it would be less expensive.
(4) A distributed system would be much easier to
update, i.e., adding a new document would not
require to reindex the whole corpus.
130
(5) A distributed system would be more likely to be
useflil in order to study the kinds of problems and
solutions that are likely to be encountered in a
real world situation.
SUMMARY OF RESULTS
We have processed the total of 850 MBytes of text
during TREC-2. The first 550 MBytes were articles from
the Wall Street Journal which were previously processed
for TREC-1; we had to repeat most of tile processing to
correct early tokenization errors introduced by tile
tagger. The entire process (tagging, parsing, phrase
extraction) took just over 4 weeks on 2 Sun's SparcSta-
tions (1 and 2). Building a comprehensive index for the
WSJ database took up another 2 weeks. This time we
were able to create a single index thanks to the improved
indexing software received from NIST. The final index
size was 204 MBytes, and included 2,274,775 unique
terms (of which about 310,000 were single word terms,
and the remaining 1,865,000 were syntactic word parrs)
oceur]iug in 173,219 documents, or more than 13
(unique) terms per document. Note that this gives poten-
tially much better coverage of document content than
single word terms alone with less than 2 unique terms per
document. We say `potentially' since the proc[OCRerr]ss of
deriving phrase-level terms from text is still only par-
tially understood, including the complex problem of
`normalization' of representation.
The remaining 300 MBytes were articles from tile
San Jose Mercury News, which were contained in TIP-
SThR disk-3. Processing of this part, and creating an
index for routing purposes took about 3 weeks. While
natural language processing required 2 weeks to com-
plete (at approximately the same speed as WSJ data-
base), we were able to cut indexing time in half by using
a faster in-memory version of the NIST system. This new
version reduces the time required by the first phase of
indexing froni days to hours, however the second phase
remains slow (days) and fraglle (we had to redo it 3
times). The final size of the SJMN index was 101
MBytes, with 1,535,971 unique terms occurring in
86,270 documents (nearly 18 unique terms per docu-
ment).13
Two types of retrieval have been done: (1) new
topics 101-150 were run in the ad-hoc mode against WSJ
database, and (2) topics 51-100, previously used in
TREC-1, were run in the routing mode against SJMN
database. "leach category several runs were attempted
13 [OCRerr] has to be noted that the ratios at which new terms are gen-
erated are nearly identical in both databases; at 86,319 documents (or
about half way through WSJ database) 1,335,622 unique terms had
been recorded.