SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Multilevel Ranking in Large Text Collections Using FAIRS
chapter
S-C. Chang
H. Dediu
H. Azzam
M-W. Du
National Institute of Standards and Technology
Donna K. Harman
showing there is no perfect ranking rules that work in all
situations. This attribute may have either a positive or neg-
alive impact, depending on user intention. Although there
is no obviously better default setting for this attribute, we
chose positive to be the default, so as to give priority to the
later records as they are likely to be more timely.
REC_SIZE: The total word count of the record. This
attribute may be used to counter (normalize) the size
advantage that a larger record may have over smaller ones
during rank judgement. A larger record may have more
keywords simply because it contains more words. It is
therefore we set its default to have a negative impact on
the relevance judgement. Of course, when records are of
similar size, this attribute will have minimal effect, and
should probably be disabled to improve response time.
WORD_LOC: The location of the first occurrence of the
keyword in the record. A negative setting (the default) of
this attribute assumes that important words appear in the
beginning of a collection, or record. For example, head-
ings or tides which contain keywords describing the con-
tents of a document, usually appear at the beginning. Of
course, this depends solely on how the contents of the
information are organized. This serves as another example
of the context-sensitivity of the ranking process.
The following table shows the default ranking rules:
Table 1: Ranking Attributes Settings for TREC92
level Imp Pop Freq Size ID Loc
1 --- --- ---
3 --- neg - - --- ---
4 pos neg pos neg - -
6
pos
The first level, having no attribute values, automatically
accounts for coverage if the weight computation is done
using Method 1, as described below in section 2.1.9.
To perform ThEC92 experiments, we changed the ranking
rules as follows: Size was introduced on level one to bal-
ance the effect of the very large records found in the Fed-
eral Register. Consequently, we also had to specify
impacts for the Importance, Popularity and Frequency
attributes on the first level since they are the most impor-
tant criteria for ranking.
The following table shows the TREC92 ranking rules:
332
Table 2: Ranking Attributes Settings for TREC92
level Imp Pop Freq Size ID Loc
1 pos neg pos neg - -
2 pos neg - neg -
3 pos neg pos neg -
5 --- --- --- --- pos
2.1.9 Weight Computation
Two methods were available to compute the weight of a
document:
U Method 1
The weight of a retrieved record r at level l is determined
by the following formula:
K FA[OCRerr]
W1[r] = Ie[i]Ha1,[OCRerr]'['1I
£=1 L[OCRerr]=1J
Where:
e[i] = 1 if the ith keyword exists in the record and 0
otherwise,
= The value of the attributej9
A Number of attributes (currently 6,)
K = Total number of query keywords,
s1,1 = 1 if the value of attributej is positive,
0 if the value of attributej is neutral,
-1 if the value of attributej is negative.
In other words, the weight at a certain level is the sum of
the product of the attributes at that level. This weight com-
putation method automatically calculates coverage since
for each keyword e[i], the product of attributes is never 0.
The value of the attribute is configured by the user either
before running FAIRS, or before delivering the query. To
set the value before running FAIRS, a file must be created
containing the initial values. A default value is set during
system start-up.
* Method2
One disadvantage of method 1 is that it lacks a common
reference scale that evenly distributes the influence of the
attributes on the weight of the document. For example,
consider a textbase with one very large record, record