SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) A Single Language Evaluation of a Multi-lingual Text Retrieval System chapter T. Dunning M. Davis National Institute of Standards and Technology Donna K. Harman A Single Language Evaluation of a Multi-lingual Text Retrieval System Ted Dunning Mark Davis Computing Rese&ch Laboratory New Mexico State University Las Cruces, NM 88003 1. Introduction The goal of the participation by the Computing Research Laboratory at New Mexico State University in the Text Retrieval Evaluation Conference was to evaluate our multi-lingual text retrieval system in a mono-lingual setting and in the context of a large set of documents. This system is currently being developed for use in a multi4ingual setting, but we felt that there were novel aspects to the system which should prove useflil in retrieval from a large text base. In particular, the system is able to find and make use of phrases in queries which will aid in retrieval. Marking which phrases are significant was done using a likelihood ratio test which is more reliable than previously used measures. 2. System Structur our Multi-Lingual Text-Retrieval system consisted of a fairly conventional system based around an inverted index structure. The positions of individual words are recorded in terms of which document they appeared in, and where in the document they appeared, both in terms of bytes, as well as in terms of words. In order, to maintain language independence as well as to improve modularity, tokenization during indexing can be done as a separate process. This modularity allows most of the system to remain unchanged when changing languages from English to Japanese, or when changing the indexing structure to support fast access to, say, a traditional dictionary of word definitions where only the various spelling forms of the head word might be indexed. All postings in the final database are consrant length structures in order to facilitate and simplify vector oriented operations. Document numbers and locations are uniformly stored as 4 byte integers in order to avoid for all practical purposes any limitations on the number or size of documents and tokens. Since full positional information is kept, it is relatively fast and simple 193