SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Vector Expansion in a Large Collection chapter E. Voorhees Y-W. Hou National Institute of Standards and Technology Donna K. Harman surrogate (3 senses) {depu[OCRerr][OCRerr], surroga[OCRerr]e} { surrogaie} { alierna[OCRerr]e, proxy, siand-in, subs[OCRerr]i[OCRerr]u[OCRerr]e, surrogaie, replacement} opinion (5 senses) {opinion, ruling} { opinion} { opinion, sen[OCRerr]imen[OCRerr], persuasion, view} {judgment, judgemenL, Opifliofl} { opinion, view} motherhood (1 sense) {moiherhood, maiernity} decision (4 senses) { decision} {judgmen[OCRerr], judgement, decision, judiciaLdecision} { decision, deciding, decision[OCRerr]making} { decision, firmness} court (4 senses) { cour[OCRerr], cour[OCRerr]yard} { couvt [OCRerr]ennis_cour[OCRerr]} { couvt, cour[OCRerr]room} {cour[OCRerr], tnbunal} Figure 1: Example Synonym Sets customized pieces of the program without needing to modify other components. For this work, we changed only part of the indexing module of the standard SMART system. Indexing a piece of text proceeds as follows: 1. The text is broken into tokens by the standard SMART tokenizer. 2. Each token is passed in turn to a parser. The parser eliminates tokens designated as numbers, white space, or punctuation; the remaining tokens are assumed to be "words". 3. Each word is looked up in the standard SMART stop word list and is eliminated if it is found there. If the word is not a stop word, it is stemmed (using the SMART triestem stemming algorithm), assigned a concept number, and added to the list of concepts that will form the vector. 4. A word that is not a stop word is also looked up in the noun portion of WordNet before it is stemmed. If the word is in WordNet, the set of synonyms from all the synsets the word is a member of is produced. The elements of this set are also stemmed and assigned concept numbers. Instead of the concepts being inserted into the vector list, however, they are inserted into a different rela[OCRerr]ive liSt. Relatives that come from original text words that have only one sense in WordNet (appear in exactly one synset) are flagged as such when they are entered into the list. 345