SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Knowledge-Based Searching with TOPIC
chapter
J. Lehman
C. Reid
National Institute of Standards and Technology
D. K. Harman
2.3 USER INTERFACE TO SEARCH
Every search is automatically configured into a rule. The
simplest search is a list of terms, which may be entered
at the keyboard, selected from displayed document(s)
content, or selected from lists of terms. This list is
automatically enhanced by term expansion, expansion to
existing named rules whenever the rule name appears in
the search expression, and evidence aggregafion. Searches
involving structured fields are generally addressed by a
form interface, which aggregates field and full-text
content. Any list of terms, rule-names, or extensions
such as thesaurus/soundex may be used to initiate a
search or add to a search expression.
2.4 SEARCH RESULT ANALYSIS AIDS
The Topic philosophy of minimizing the elapsed fime to
obtain the necessary relevant details that constitute an
answer or support a decision necessitates analysis aids
beyond the search composition and result list display.
The Topic result list may be browsed (page, result
number etc.). A document selected for display produces
the full text display with all search evidence highlighted
(e.g. in reverse video or color). The display may be the
native form of the document, which for most of today's
collections means a marked-up format with useful user
guidance in the markup itself (e.g. sections, paragraph
headings etc.). The user may choose to browse or to
move directly to the firstinextiprevious occurrence of a
search term in the document. Similarly, the user may
move through the document using various document
enhancements such as hypertext links, may follow
hypertext links to other documents, including graphics
and other media. Previously generated annotafions are
available for browsing. Queries or other applications
may be linked to document content. A specific search
term (not necessary to be a part of original search) may
be used as a browsing aid to the document.
2.5 SECURITY
Users may be prevented from accessing information via
operating system permissions, and built-in access
controls, including discrefionary. The product processes
have been certified at system high in many installations,
and some sponsors have applied for MLS certifications
based upon the delivered product.
2.6 DATA ARCHITECTUREI
PERFORMANCEI CONHGURATION
Topic enables the logical division of a collection of
documents into "partitions", which are document
descripfions and indexing data about the arbitraryl
intentional subset. Partition size, purpose and
characterisfics are under the application administrator's
control. The raw documents are not "owned" by the
Topic application. Topic will produce indicies which are
approximately 70% of the size of the native text size
(the TREC-2 index size was approximately 50%). This
includes fielded, word, and subject (rule evidence) level
indicies.
The partition data is platform-independent (i.e. the
documents and their associated partitions may be
moved/accessed from any Topic platform.
Searches may be performed on the served desktop, on a
host or both.
Normal performance on a personal computer is in the
thousands of document-rule nodes per second, up to
many tens of thousands of nodes per second on current
workstations. The search rule low level evidence is
contained in a sizelspeed-opfimized index (iQ[OCRerr]i[OCRerr], which
is essential to rapid response on complex rules. This
index is automatically modified each fime topic evidence
is added, so the word positional information is searched
only on the first use of the term. The topics index
normalizes document size so that all search response
times are predictable. Partifions enable incremental
(ranked) results, guaranteeing few-second time-to-first-
result, regardless of the size of the collection. The
response characteristic which Topic opfimizes is the
time-to-first-meaningful-result. The rule evidence index
may be centralized or distributed, and when distributed, it
provides the ability to produce a ranked results list with
a minimum of network access.
Integration with third party components is available
from the end user interface, or shared libraries. The
program provides logical links between document-image,
document-document, document-annotation, document-
search request. Some links may be automatically
determined at indexing time (image, cross-reference).
The structured field values may be entered interactively,
or filled automatically from a lexical analyzer. The
program provides an enduser process interface between
scanning, OCR/ICR and indexing.
3. THE TREC EXPERIMENTS
3.1 DATA PREP[OCRerr]ON
The TREC-2 texts data preparation processing was
performed on a Sun SPARC 10 (UNIX 4.1.3).
Cataloguing and indexing was performed at the rate of
approximately 100 Mbytes per hour. This process
included the automafic extraction of 10 fields from the
ASCII content. Partitions were set at 8000 documents
for all data. There were no processing errors.
212