SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Knowledge-Based Searching with TOPIC chapter J. Lehman C. Reid National Institute of Standards and Technology D. K. Harman 2.3 USER INTERFACE TO SEARCH Every search is automatically configured into a rule. The simplest search is a list of terms, which may be entered at the keyboard, selected from displayed document(s) content, or selected from lists of terms. This list is automatically enhanced by term expansion, expansion to existing named rules whenever the rule name appears in the search expression, and evidence aggregafion. Searches involving structured fields are generally addressed by a form interface, which aggregates field and full-text content. Any list of terms, rule-names, or extensions such as thesaurus/soundex may be used to initiate a search or add to a search expression. 2.4 SEARCH RESULT ANALYSIS AIDS The Topic philosophy of minimizing the elapsed fime to obtain the necessary relevant details that constitute an answer or support a decision necessitates analysis aids beyond the search composition and result list display. The Topic result list may be browsed (page, result number etc.). A document selected for display produces the full text display with all search evidence highlighted (e.g. in reverse video or color). The display may be the native form of the document, which for most of today's collections means a marked-up format with useful user guidance in the markup itself (e.g. sections, paragraph headings etc.). The user may choose to browse or to move directly to the firstinextiprevious occurrence of a search term in the document. Similarly, the user may move through the document using various document enhancements such as hypertext links, may follow hypertext links to other documents, including graphics and other media. Previously generated annotafions are available for browsing. Queries or other applications may be linked to document content. A specific search term (not necessary to be a part of original search) may be used as a browsing aid to the document. 2.5 SECURITY Users may be prevented from accessing information via operating system permissions, and built-in access controls, including discrefionary. The product processes have been certified at system high in many installations, and some sponsors have applied for MLS certifications based upon the delivered product. 2.6 DATA ARCHITECTUREI PERFORMANCEI CONHGURATION Topic enables the logical division of a collection of documents into "partitions", which are document descripfions and indexing data about the arbitraryl intentional subset. Partition size, purpose and characterisfics are under the application administrator's control. The raw documents are not "owned" by the Topic application. Topic will produce indicies which are approximately 70% of the size of the native text size (the TREC-2 index size was approximately 50%). This includes fielded, word, and subject (rule evidence) level indicies. The partition data is platform-independent (i.e. the documents and their associated partitions may be moved/accessed from any Topic platform. Searches may be performed on the served desktop, on a host or both. Normal performance on a personal computer is in the thousands of document-rule nodes per second, up to many tens of thousands of nodes per second on current workstations. The search rule low level evidence is contained in a sizelspeed-opfimized index (iQ[OCRerr]i[OCRerr], which is essential to rapid response on complex rules. This index is automatically modified each fime topic evidence is added, so the word positional information is searched only on the first use of the term. The topics index normalizes document size so that all search response times are predictable. Partifions enable incremental (ranked) results, guaranteeing few-second time-to-first- result, regardless of the size of the collection. The response characteristic which Topic opfimizes is the time-to-first-meaningful-result. The rule evidence index may be centralized or distributed, and when distributed, it provides the ability to produce a ranked results list with a minimum of network access. Integration with third party components is available from the end user interface, or shared libraries. The program provides logical links between document-image, document-document, document-annotation, document- search request. Some links may be automatically determined at indexing time (image, cross-reference). The structured field values may be entered interactively, or filled automatically from a lexical analyzer. The program provides an enduser process interface between scanning, OCR/ICR and indexing. 3. THE TREC EXPERIMENTS 3.1 DATA PREP[OCRerr]ON The TREC-2 texts data preparation processing was performed on a Sun SPARC 10 (UNIX 4.1.3). Cataloguing and indexing was performed at the rate of approximately 100 Mbytes per hour. This process included the automafic extraction of 10 fields from the ASCII content. Partitions were set at 8000 documents for all data. There were no processing errors. 212