SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Knowledge-Based Searching with TOPIC chapter J. Lehman C. Reid National Institute of Standards and Technology D. K. Harman Knowledge-Based Searching with TOPICŪ John W. Lehman, Clifford A. Reid, et al. Verity, Inc. 1550 Plymouth Street, Mountain View, CA 94043 (415) 960-7620 I jlehman @ verity.com 1. OBJECTIVE OF VERYFY'S TREC-2 EXPERIMENTS Verity, Inc. is the first major commercial product participant in ThEC. Verity's product is TOPICŪ. Verity participated in ThEC-2 as a Category A Site. This participafion was Verity's first TREC, and we encountered many of the logistical problems of other sites in their TREC- 1 experience. Topic's search users wish to understand the search result quality to expect in their personal searches on their (large) collecfions. Verity also expects to obtain insights for future product improvements. Topic is a mature commercial-off-the-shelf manual text search program combining the results of human expertise with a powerful search expression language and fast search algorithms. Topic's installations use manually or semi-automatically developed libraries of searches (topics) , which are instances of the search expression language and which are supplied to all users. Verity begins its TREC experiments with a gathering of "ground truth" regarding unaided adhoc end user search result quality. Future experiments will incorporate predefined searches (topics) and other Topic search aids to determine their level of improvement/impact on search result quality. 2. TOPIC SEARCH APPROACH The Topic philosophy: Domain knowledge, both descriptive and content-based, using constructs specifically designed to discriminate between jull text material, is the only way to consistently obtain high recalbprecision on large heterogeneous collections. Search result quality may be enhanced by the employment of collection-specific statistics to locate additional domain-relevant terminology. Searches are repeated and subject-matter expertise is a scarce resource. The problem that Topic addresses is the effective use of a human's time in analyzing search results to locate the 209 preponderance of relevant details in the fewest possible documents, and therefore the smallest possible elapsed time. 2.1 TOPIC KNOWLEDGE REPRESENTATION The Topic product employs several approaches to individual term search, organized by a rule-based, or concept-based, approach to search term aggregafion. In Topic, the search focus is the topic, (concept, notion, idea, or subject), and the topic is the user-specified "smart" description of all of the evidence "about" or "of' the topic as it (the evidence) would be found in text documents. 2.1.1 TOPICINDICIES The Topic product line catalogs and indexes both fielded (structured) data, and full-text. Topic automatically extracts structured data (such as title, author, etc.) into searchable fields, using a lexical analyzer. Fielded data is searchable separately or in combination with full-text. Indexes on the full-text are (for all non-stopped characters and strings): -word/string -stemmed word (morphological variant) -soundex (phonetic spelling variety) -statistically correlated terms (called the suggestion index) -typographical error index -thesaurus -wildcard (universal character/group expansion) An index on all values (choices) for fielded data is also produced. 2.1.2 TOPIC SEARCH RULES Search rules cop.. sist of relational comparisons to field values, exact or fuzzy matches on full-text search terms,