SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Multilevel Ranking in Large Text Collections Using FAIRS chapter S-C. Chang H. Dediu H. Azzam M-W. Du National Institute of Standards and Technology Donna K. Harman Multilevel Ranking m Large Text Collections Using FAIRS S-C. Chang, H. Dediu, H. Azzam, M-W. Du GTE Laboratories Abstraci 2.0 System Overview A description of a general-purpose multilevel ranking information retrieval prototype is presented. The methods used in weighting and ranking the retrieved documents are discussed. Experiments with the TREC92 collection of text and queries have been conducted without manual pre- processing. Initial results have shown the multilevel rank- ing scheme to be highly competitive in precision and recall relative to other ranking strategies. 1.0 Introduction Information retrieval research at GTh Laboratories has led to the development of a prototype system called FAIRS (Friendly Adaptable Information Retrieval System). FAIRS has evolved into a functional and flexible system currently running on SunOS, HPILJX, Ultrix, AIX, VMS and PC platforms. It has been used in environments as diverse as literature searching, library operations, research and development, customer support, market analysis and management. FAIRS is being further tested as a retrieval engine for very large collections of text, such as those pre- sented by the TREC92 collection and wide-area distrib- uted collections. FAIRS is designed to minimize user effort in the prepara- tion of text, the learning of query syntax while providing a user-modifiable multilevel ranking scheme1. Experiments with the TREC92 collections of text and queries have been conducted with no human intervention in the processing of either text or queries. The results of experiments against the collection of Wall Street Journal articles are listed in Section 3A. 1. PatentPending 329 FAIRS uses pure text as its information base, while allowing flexible links into non-textual information. FAIRS extracts information out of an unstructured, amorphous collection of data, in four main steps: 1. First, it partitions (logically) a raw text file or a col- lection of such files into retrievable record units. This simple record partitioning is necessary and suf- ficient for indexing to begin. The goal is to use source information as-is [I]. 2. Second, FAIRS automatically constructs an index. A feature exists where deletions are permitted from an index. Statistics on the collections can also be gener- ated at this time. Such statistics are used in normaliz- ing the weighting of retrieved documents. 3. The third step involves the user queries. Queries are accepted and interpreted in an intelligent and sensi- tive manner [1,2]. A flexible approach to the under- standing of the query is essential to providing good responses. 4. Finally, once the query is processed, the relevant hits are retrieved quickly and displayed in an ordered list ranked according to a relevance measure. The records corresponding to the hits can then be viewed, printed or mailed in their entirety on demand. 2.1 Char8cterIstIcs In FAIRS, both the responses to information requests and the way relevance is determined can be customized by the user. FAIRS can provide a tool for decision mak- ing by presenting to the user all the relevant facts in an elegant and timely fashion. There are some interesting and novel features of FAIRS and research issues associ- ated with each of these strategies. They will be dis- cussed in sequence.