NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) UCLA-Okapi at TREC-2: Query Expansion Experiments chapter E. Efthimiadis P. Biron National Institute of Standards and Technology D. K. Harman Table 3.1 Methodology for the Routing Runs on Topics 51-100 _______ Weighting Expansion Terms used for UCLA [OCRerr] Query Number of No. of Docs Function Phrases QE Algorithm Expanded Auto Rel Fbk GSL bm15 no no w[OCRerr]q 0 0 no yes yes emim 10 5 yes both por[OCRerr]er 20 10 r[OCRerr]lohi 30 15 __________ _________ r[OCRerr]hilo ___________ 20 ________ from the Topics, which were the source of the search terms. NO means that the terms extracted from the Concepts and Title fields are single terms only. YES means that phrases get extracted as determined by the simple routine, where a phrase is identified by using the punctuation found in the Concepts and Title fields. BOTH is the combination of the two methods and the terms are searched as single terms as well as phrases. Query Expansion (QE): The choice of query expansion algorithms is one of wpq, emim, portev, r[OCRerr]lohi, r[OCRerr]hilo. Terms expanded: This specifies the number of terms to include in the expansion. When the number of terms expanded is zero, then only the initial query is run. Feedback documents: ranked documents to provide the source for This defines the number of top be treated as relevant and to the terms for query expansion. UCLA GSL: defines whether the standard Okapi GSL or the UCLA enhanced version of the GSL will be used. Because of the many parameters involved in each run the names of runs have been deliberately made explicit, which however resulted in rathcr long names. For exam- ple, bmlS .phb. qey:r[OCRerr]ohi-i0-5 .uclagsly means that for this run the weighting function used was the BM 15, phrases were set to BOTH, query expansion took place, the r[OCRerr]lohi algorithm was used for the ranking of terms for query ex- pansion, 10 terms were added in the expansion, 5 docu- ments provided the source of the terms for the expansion, and the UCLA enhanced GSL was also used. 3.2 Go-See-List The G[OCRerr]See-List (GSL) is a look-up table that contains stopwords, semi-stopwords, prefixes, g[OCRerr]phrases and syn- onym classes. The GSL is used during the indexing of a database as well as during searching. Stopwords contain an array of terms that are thought to contain no or little value for retrieval. These include, contractions, prepositions, adverbs, etc. 282 The semi-stopwords are terms that are thought to have low value for retrieval purposes. Therefore, a semi- stopword will be searched only during the initial search if it has been part of the user's search statement. If, however, the term has emerged as the result of a query expansion it is stopped, i.e. excluded from the pool of candidate terms for query expansion. Go-phrases are mostly noun-phrases that need to be searched as one word or else the precision will be very low, e.g. New York. GSL contains a small number of selected g~phrases. Synonym entries contain a mix of terms/concepts that are treated as synonyms for retrieval purposes. These may be true synonyms, quasi-synonyms, or unrelated semanti- cally terms which are grouped together because of some common properties which have value for retrieval. Finally, the synonym entries also contain term variants that are known to "escape" from the conflation algorithm. The structure of the UCLA GSL is given in the table below. The GoSee-List (GSL) City added UCLA by UCLA total stopwords 411 72 483 semi-stopwords 58 58 prefixes 18 18 Go-phrases 43 84 127 Synonyms 359 604 963 For the UCLA GSL, the Titles and Concepts of Top- ics 1-100 were analyzed and synonym classes were gener- ated from the data. The list includes: 40 personal names, and 250 synonym classes. In addition, a list of organiza- tions and a list of common business acronyms and abbre- viations was compiled. 3.3 Query term selection Query terms were selected from the Title and Concepts fields of the records. The processing of these fields was very simple. Programs written in awk and pen were used to isolate the required fields, which were then parsed and the resulting terms stemmed in accordance with the in- dexing procedures followed for building the WSJ database.