SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Knowledge-Based Searching with TOPIC
chapter
J. Lehman
C. Reid
National Institute of Standards and Technology
D. K. Harman
Knowledge-Based Searching with TOPICŪ
John W. Lehman, Clifford A. Reid, et al.
Verity, Inc.
1550 Plymouth Street,
Mountain View, CA 94043
(415) 960-7620 I jlehman @ verity.com
1. OBJECTIVE OF VERYFY'S TREC-2
EXPERIMENTS
Verity, Inc. is the first major commercial product
participant in ThEC. Verity's product is TOPICŪ.
Verity participated in ThEC-2 as a Category A Site.
This participafion was Verity's first TREC, and we
encountered many of the logistical problems of other
sites in their TREC- 1 experience.
Topic's search users wish to understand the search result
quality to expect in their personal searches on their
(large) collecfions. Verity also expects to obtain insights
for future product improvements.
Topic is a mature commercial-off-the-shelf manual text
search program combining the results of human
expertise with a powerful search expression language and
fast search algorithms. Topic's installations use
manually or semi-automatically developed libraries of
searches (topics) , which are instances of the search
expression language and which are supplied to all users.
Verity begins its TREC experiments with a gathering of
"ground truth" regarding unaided adhoc end user search
result quality. Future experiments will incorporate
predefined searches (topics) and other Topic search aids to
determine their level of improvement/impact on search
result quality.
2. TOPIC SEARCH APPROACH
The Topic philosophy: Domain knowledge, both
descriptive and content-based, using constructs
specifically designed to discriminate between jull text
material, is the only way to consistently obtain high
recalbprecision on large heterogeneous collections.
Search result quality may be enhanced by the
employment of collection-specific statistics to locate
additional domain-relevant terminology. Searches are
repeated and subject-matter expertise is a scarce resource.
The problem that Topic addresses is the effective use of a
human's time in analyzing search results to locate the
209
preponderance of relevant details in the fewest possible
documents, and therefore the smallest possible elapsed
time.
2.1 TOPIC KNOWLEDGE
REPRESENTATION
The Topic product employs several approaches to
individual term search, organized by a rule-based, or
concept-based, approach to search term aggregafion. In
Topic, the search focus is the topic, (concept, notion,
idea, or subject), and the topic is the user-specified
"smart" description of all of the evidence "about" or "of'
the topic as it (the evidence) would be found in text
documents.
2.1.1 TOPICINDICIES
The Topic product line catalogs and indexes both fielded
(structured) data, and full-text. Topic automatically
extracts structured data (such as title, author, etc.) into
searchable fields, using a lexical analyzer. Fielded data is
searchable separately or in combination with full-text.
Indexes on the full-text are (for all non-stopped characters
and strings):
-word/string
-stemmed word (morphological variant)
-soundex (phonetic spelling variety)
-statistically correlated terms (called the
suggestion index)
-typographical error index
-thesaurus
-wildcard (universal character/group expansion)
An index on all values (choices) for fielded data is also
produced.
2.1.2 TOPIC SEARCH RULES
Search rules cop.. sist of relational comparisons to field
values, exact or fuzzy matches on full-text search terms,