SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
UCLA-Okapi at TREC-2: Query Expansion Experiments
chapter
E. Efthimiadis
P. Biron
National Institute of Standards and Technology
D. K. Harman
Table 3.1 Methodology for the Routing Runs on Topics 51-100 _______
Weighting Expansion Terms used for UCLA [OCRerr]
Query Number of No. of Docs
Function Phrases QE Algorithm Expanded Auto Rel Fbk GSL
bm15 no no w[OCRerr]q 0 0 no
yes yes emim 10 5 yes
both por[OCRerr]er 20 10
r[OCRerr]lohi 30 15
__________ _________ r[OCRerr]hilo ___________ 20 ________
from the Topics, which were the source of the search
terms. NO means that the terms extracted from the
Concepts and Title fields are single terms only. YES
means that phrases get extracted as determined by the
simple routine, where a phrase is identified by using
the punctuation found in the Concepts and Title fields.
BOTH is the combination of the two methods and the
terms are searched as single terms as well as phrases.
Query Expansion (QE): The choice of query expansion
algorithms is one of wpq, emim, portev, r[OCRerr]lohi, r[OCRerr]hilo.
Terms expanded: This specifies the number of terms to
include in the expansion. When the number of terms
expanded is zero, then only the initial query is run.
Feedback documents:
ranked documents to
provide the source for
This defines the number of top
be treated as relevant and to
the terms for query expansion.
UCLA GSL: defines whether the standard Okapi GSL or
the UCLA enhanced version of the GSL will be used.
Because of the many parameters involved in each run
the names of runs have been deliberately made explicit,
which however resulted in rathcr long names. For exam-
ple, bmlS .phb. qey:r[OCRerr]ohi-i0-5 .uclagsly means that for
this run the weighting function used was the BM 15, phrases
were set to BOTH, query expansion took place, the r[OCRerr]lohi
algorithm was used for the ranking of terms for query ex-
pansion, 10 terms were added in the expansion, 5 docu-
ments provided the source of the terms for the expansion,
and the UCLA enhanced GSL was also used.
3.2 Go-See-List
The G[OCRerr]See-List (GSL) is a look-up table that contains
stopwords, semi-stopwords, prefixes, g[OCRerr]phrases and syn-
onym classes. The GSL is used during the indexing of a
database as well as during searching.
Stopwords contain an array of terms that are thought
to contain no or little value for retrieval. These include,
contractions, prepositions, adverbs, etc.
282
The semi-stopwords are terms that are thought to have
low value for retrieval purposes. Therefore, a semi-
stopword will be searched only during the initial search if
it has been part of the user's search statement. If, however,
the term has emerged as the result of a query expansion it
is stopped, i.e. excluded from the pool of candidate terms
for query expansion.
Go-phrases are mostly noun-phrases that need to be
searched as one word or else the precision will be very low,
e.g. New York. GSL contains a small number of selected
g~phrases.
Synonym entries contain a mix of terms/concepts that
are treated as synonyms for retrieval purposes. These may
be true synonyms, quasi-synonyms, or unrelated semanti-
cally terms which are grouped together because of some
common properties which have value for retrieval. Finally,
the synonym entries also contain term variants that are
known to "escape" from the conflation algorithm. The
structure of the UCLA GSL is given in the table below.
The GoSee-List (GSL)
City added UCLA
by UCLA total
stopwords 411 72 483
semi-stopwords 58 58
prefixes 18 18
Go-phrases 43 84 127
Synonyms 359 604 963
For the UCLA GSL, the Titles and Concepts of Top-
ics 1-100 were analyzed and synonym classes were gener-
ated from the data. The list includes: 40 personal names,
and 250 synonym classes. In addition, a list of organiza-
tions and a list of common business acronyms and abbre-
viations was compiled.
3.3 Query term selection
Query terms were selected from the Title and Concepts
fields of the records. The processing of these fields was
very simple. Programs written in awk and pen were used
to isolate the required fields, which were then parsed and
the resulting terms stemmed in accordance with the in-
dexing procedures followed for building the WSJ database.