SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Multilevel Ranking in Large Text Collections Using FAIRS
chapter
S-C. Chang
H. Dediu
H. Azzam
M-W. Du
National Institute of Standards and Technology
Donna K. Harman
should be customized. After examining a preview or the
full text of the highly ranked records, the user can then
revise the query or even change the ranking strategy.
2.1.6 Synonyms
A synonym definition capability (glossary) is also avail-
able within FAIRS. Given a word, FAIRS will retrieve
instances of that word, as well as instances of its syn-
onyms. The option can be invoked to broaden a query.
Users can build their own vocabulary or invoke and mod-
ify the system-wide synonym dictionary. This can be used
to ease the problem of different word usages from differ-
ent people.
A very fast elastic string matching algorithm2 [5] is being
evaluated for inclusion among the query expansion fea-
tures of FAIRS.
2.1.7 Displaying Records
When a request is made, users will always be presented
with the retrieved records in their full-text form. Further-
more, FAIRS provides several ways to associate non-text
information with each record. There are basically two
types of links: implicit and explicit. Implicit links use
source information as-is while explicit links involve spe-
cial fields embedded in the source text. Implicit links are
useful in situations where an implied one4o-one mapping
may be established between records and image files.
Explicit links may be used to express one-to-many rela-
tions between records and other media.
2.1.8 Ranking
Gne of the most interesting aspects of FAIRS is its uncon-
ventional ranking scheme to determine the relative rele-
vance of retrieved records. The ranking scheme is
designed to mimic the human relevancy judgement pro-
cess.
When a person is asked to determine the relative relevance
between two records he is likely to first weigh them using
a set of criteria. If the two records have the same weight
with the set of criteria, a secondary set of criteria may be
used to differentiate them, and so on. The criteria used can
be highly heuristic. The adaptation of this "multilevel
ranking scheme" has been filed with the Patent office in
the United States.
To enable FAIRS to use free association in place of Bool-
ean semantics, a multilevel ranking model [1,2] for full-
2. Patent Pending
331
text information retrieval has been developed and imple-
mented. FAIRS ranks records with respect to a particular
query according to a set of rules. The default rules consist
of six attributes in six levels. The six attributes are the
importance, popularity, frequency, location of a search
word, and record size, and record JD of the record it occurs
in. Each attribute may have either positive, negative or no
impact (neutral) on the relevance judgement of a record.
Such arrangement also guarantees the automatic consider-
ation of coverage (i.e., percentage of different query words
covered by record), which is next to impossible to imple-
ment in a Boolean environment [1,2]. The ranking rules of
FAIRS are always accessible and modifiable by the user.
Descriptions of the attributes chosen for FAIRS follow:
FREQUENCY: The number of occurrences of the key-
word in the record. This attribute may be used to reflect
interest in finding records with more repetitions of a given
term. That is, when set to have a positive impact, the more
instances of a term in a record, the more relevant the
record. Therefore, the record with the higherfrequency of
a term is more likely to be retrieved.
iMPORTANCE: FAIRS provides the searcher the ability to
assign an arbitrary weight (importance) to each query key-
word; thus the user has additional control over how
records are retrieved. For example, specifying brea[OCRerr]ast:4
assigns a weight of 4 to the term breakfast in a query. In
general, keyword weighting allows the searcher to change
how FAIRS sorts records in order to identify the most rele-
vant ones. If a keyword is not weighted, FAIRS supplies a
default weight which is in reverse proportion to its input
position in the query (words that came to mind first
deserve more weight). Therefore, if the searcher chooses
to weight some or all keywords in a query with a range of
weights, records strong in the heavily-weighted keywords
are ranked before others. This attribute is defaulted to have
a positive impact on the relevance judgement.
POPULARITY: This is the number of times a term occurs
in the entire collection, as opposed to the number of times
it appears in the retrieved record. For example, if the word
software appeared, at least once, in 15 records in a collec-
tion, its popularity is considered to be 15. This attribute is
usually used in the negative sense and, by default, FAIRS
assumes that the more popular a term is, the less effective
it is in retrieval.
REC_ID: The record ID is the location of a record in the
collection. It may indicate the age of the record.This is
also useful when the records are arranged according to
their degree of significance (in either increasing or
decreasing order.) This attribute is a good example for