SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Single Language Evaluation of a Multi-lingual Text Retrieval System chapter T. Dunning M. Davis National Institute of Standards and Technology Donna K. Harman -2- to search for phrases, or to find instances of words appearing near each other. We plan in the near future to alter one of the position indices to store sentence position instead of word position relative the beginning of a file. In other CRL systems, the ability to determine whether words appear in the same sentence has proved very useful. 3. Indexing Operations Indexing for the Multi-Lingual Text-Retrieval system consists of a 4-pass process. The passes include tokenization, relabelling, sorting and stripping. 3.1. Tokenization Tokenization in English consists of putting each word in the input text on a single line along with positional and document information. Kill list processing is done at this stage, as well, as creation of a word and document table. Lemmatization and suffix stripping can be done at this stage, but no savings result since this merely increases the average length of posting vec- tors during retrieval. Further, early lemmatization or suffix stripping makes it impossible to later avoid lemmatization or suffix stripping. In Japanese, tokenization down to the level of strings of consecutive kanji or katakana is supported as well as tokenization down to individual kanji and strings of katakan[OCRerr] Since strings of hiragana typically are used for inflection, they are ignored. This results in words being ignored, since there are words in Japanese (mostly verbs) which are written in hiragana. Using kanji strings results in indexing a large number of phrases, while indexing each kanji ide- ograph separately results in an increasing number of phrase searches later in the retrieval pro- cess with an accompanying drop in performance. Preprocessing to mark and normalize company, place and person names and dates is best included into the tokenization process. We plan to incorporate elements of CRL's Tipster Extraction software into this program at some time to investigate the impact this has on perfor- mance, but we have not done so at the present time. Indices for fielded data can be created by using a conventionalized word location such as 0. mis approach allows all such data to be handled in a uniform manner. 3.2. Relabelling The output of the tokenization process is still ASCII or some form of extended ASCII such as Unicode or JIS. The next step is the conversion to a uniform binary form which assists 194