SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Multilevel Ranking in Large Text Collections Using FAIRS
chapter
S-C. Chang
H. Dediu
H. Azzam
M-W. Du
National Institute of Standards and Technology
Donna K. Harman
number 1, and assume the following settings for the vari-
able attributes:
FREQUENCY = neutral (s[1] =0)
POPULARITY = neutral (s[2] =0)
WORD[OCRerr]LOC = neutral (s[3] =0)
REC_ID = neutral (s[4] =0)
REC_SIZE = posiave (s[5] = 1)
The contributing factors to the weight of a retrieved docu-
ment are IMPORTANCE and REC_SIZE. Because of its
large size (a[5] » 0,) record I will be irrelevantly
retrieved in response to a large percentage of pending que-
nes.
To avoid the above scenario, method 2 uses the following
formula to compute the normalized weight of a retrieved
record r at level 1:
K [OCRerr] A
W1[r] = [OCRerr] e(i] aills U]I
[OCRerr]tl L JX=1[OCRerr]iCj
Where:
e[i) = 1 if keyword i exists in the record and 0 other-
wise,
ail] = The value of the attributej,
A = Number of attributes (currently 6,)
K = Total number of query keywords,
sill = 1 if the value of attributej is positive,
0 if the value of attributej is neutral,
-1 if the value of attributej is negative.
C = Number of attributes whose value is not 0.
amill = The maximum value of the attributej.
To avoid the effect of disproportionately large attribute
values, al/Jiamill is set to 1 if a[,] is one of the largest 2%
attribute values. Method 2 has the advantage of accommo-
dating large attribute values by normalizing with respect to
their maximum.
However, this method does not automatically calculate the
coverage since sill contributes as a multiplicand rather
than as an exponent as in Method 1. Therefore, if the
attributes are all neutral (level 1 in Table 1--default
attributes,) the weight would be 0, evan though query
terms would occur in the record r. We are therefore still
looking for ways to improve the method.
333
3.0 Experiments
The TREC92 collection of text and topics was used to
quantify and analyze the performance of FAIRS in the
domain of very large collections. The entire collection
includes 2.3Gb of text from various sources and of various
formats.
3.1 System Configuration
The hardware platform used for indexing and query pro-
cessing was an IBM RS/6000 320 workstation with AIX
3.0 operating system. The core memory (RAM) installed
was 32 Mb. The net disk space available was
4,169,728Kb. The CPU clock rate was 25 MHz, with a
MIPS rating of 27. The disk access time was 9.8 ms on
average.
3.2 Indexing Performance
Our experiments showed that pefformance in indexing is
strictly constrained by I/O wait. We used several tech-
niques to reduce this constraint and optimize throughput.
We scheduled index runs to run simultaneously, thus keep-
ing the CPU busy during I/O wait. We used a kernel which
employs automatic disk caching (providing an order of
magnitude improvement in indexing time.) To further
exploit caching, large amounts of high-speed core memory
should be used (our hinit of 32 Mb is by no means ideal.)
The I/O bottleneck also implies that the very fastest disks
be used. We chose 9.8 ms disks as the fastest available for
the platform for a reasonable price. Using the above con-
figuration, the sustained indexing throughput was on the
order of 10 days (240 hrs.)/Gb. The storage overhead for
index output was 110%. That is, for 1 Gb of text, 2.1 Gb of
space should be available before indexing is initiated. By
this measure, the entire TREC92 collection of 2.3 Gb
would require 4.83 Gb of storage. Since only 4.17 Gb was
available, we participated in category B (Wall Street Jour-
nal only.)
3.3 Query processing
Before submitting the raw topics to FAIRS, they were con-
verted automatically into FAIRS-compatible queries of the
form described in section 2.1.4. The TREC92 topics were
pre-processed by a simple syntactic term token generator.
The pre-processor removed stop words, stemmed and
removed cases from topic terms. It used an ad-hoc term
weighting scheme to assign IMPORTANCE weights to
terms according to their positions in the topic (e.g. terms
occurring in the title section were assigned an arbitrarily
higher importance than those in other sections.)