SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Multilevel Ranking in Large Text Collections Using FAIRS chapter S-C. Chang H. Dediu H. Azzam M-W. Du National Institute of Standards and Technology Donna K. Harman 2.1.1 Using Files As-Is Raw textual information is broken into records (i.e., the units that will be subsequently retrieved). Consistent with the goal of using the information as-is (i.e., eliminating file conversion), FAIRS does not require the input files to be in any particular format [1). Any ASCII file can be indexed and searched. However, if a file does have a logi- cal field structure that the user wants to use in searching (e.g., restrict a search to text in the NAME field of records), then this structure must be described to FAIRS. If the information files do not have "recordt' or "field" structures then an implicit method must be used to parti- tion these files into records. To use the files as-is, FAIRS allows the user to impose a record structure. Any character string(s) can be designated as an end-of-record marker(s). Such a page/record structure has been used successfully and a number of documents have been quickly trans- formed into textbases that could be processed by FAIRS. Alternatively, a fixed line count, such as 66 lines or 78 lines, can be used as an end-of-record marker. Further- more, each file may be declared as an individual record if multiple files are present. Records can be further divided into fields by using any character string(s) as field mark- ers. For retrieval, FAIRS requlres the information base being searched to be partitioned into indexed, retrievable records. Either the record/field structure in the original text file can be used or, in the absence of such structure, it can be imposed. In either event, a description of record/ field format will control how FAIRS breaks the raw text file into records. Note that a file format description is sep- arate from the described input file which can be used as-is (without conversion). These methods of record definition are the result of practi- cal experience with text collections generated in typical business environments. A flexible input text structure or pre-processor is essential to the effectiveness of a general- purpose information retrieval system. 2.1.3 Stemming Words are stemmed at index time and at query parsing time. The stemming rules used are published by Paice [4]. The rules are incomplete but are claimed to be satis- factory. Words shorter than 4 characters are not stemmed. Conflation is used as an alternative to the inadequacies of truncation. Paice's scheme does not return "correct" roots, but does ensure that members of a family reduce to the same root, and members of different families reduce to different roots. 2.1 A Query Formation FAIRS handles free association queries for information [1,2]. Free association is the essence of being "user friendly" in information retrieval, rather than graphical user interface features such as windows, icons, mice, and pulldown menus. Free association means that there is no set of keywords that must be known and used exactly. Queries do not have to be phrased as Boolean expressions with ANDs and ORs. Words, phrases, and even word stems thought to be relevant are listed in a free association fashion. A FAIRS query has the simple form: ierm1[w1] [,]term2[w2] []...ierm[OCRerr][w[OCRerr]] Where term[OCRerr] can be a single word or a phrase and W[OCRerr] is a user-specified weight for that.word. There will be a hit for each record in the information base that contains at least one word from the query. Commas are the only delimiters used to delineate phrases, so that relative adjacency relation among search words (i.e., the distance between two words that appear as a query term such as "scenic view" or "soft- ware engineering") could be considered automatically. The weight W[OCRerr] is used at ranking time but if omitted, it is set by default to 1. 2.1.5 Previewing Recorda 2.1.2 Stop Lists FAIRS constructs an index to support the full text retrieval of records. A user-defined stop list can be used to limit the words indexed. For the TREC92 experiments the stop word list consisted of 280 words. It was customized with the addition of several common embedded SGML strings and spurious field definitions such as: docno, doc, text, journal, title, summary, descriptor, author, fileid, file_id, &bullet, &para, &degrees, etc. 330 The ranking results are shown to the user via several interactive and selectable preview features. The preview features show some aspects of the top ranked records. The aspects include the first (or all) lines containing query words, coverage, and frequency data or simply the first few lines of highly ranked records. This is designed to give users a better feel for how relevant the retrieved records really are, and thus give him a feel for how well do the ranking rules work, and how they