SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman 2.1 Selective NLP to Nominate Information Units In brief, the CLARIT indexing process as shown in Figure 1 involves several steps, one of which utilizes selective natural-language processing (NLP) to identify noun phrases (NPs) in texts, which are taken as the relevant information units in all further processing. Subsequent steps take advantage of several statistical measures of `importance' to evaluate NPs as potential index terms. One special feature of CLARIT processing is the use of an automatically-generated `first-order' thesaurus for a domain to support the selection of appropriate terms. The standard CLARIT process returns three categories of index terms, corresponding to terms that (1) occur in the document and exactly match terms in the thesaurus, (2) terms that are in the thesaurus and are more general than near-matching terms in the document, and (3) terms that are `novel' to the document and not found in the thesaurus. In addition to being categorized as an exact, general, or novel index term, each term is given a numerical relevance weight deemed to reflect its relative value in characterizing the contents of the document. 2.2 `Thesaurus Discovery' to Nominate Sets of Terms for Collections First-order thesauri are `discovered' via another CLARIT process, distinct from indexing. The process requires a sample of documents representing a `domain'. The sample must be moder- ately large (e.g., minimally 2 megabytes of text) and must be composed of documents that are more or less `about' the topic of the domain.7 In general, CLARIT `thesaurus discovery' comprises algorithms and techniques for cluster- ing phrases in collections of documents to construct first-order thesauri that optimally `cover' an arbitrary percentage of all the terminology in the domain represented by the document col- lection. `Normal' thesaurus discovery involves (1) decomposition of candidate NPs from the documents to build a term lattice in which nodes are organized hierarchically from words to phrases based on the number of phrases subsumed by the term associated with each node and (2) selection of nodes that have high subsumption scores and that also satisfy certain structural and statistical characteristics (such as being legitimate NPs, well distributed in the corpus, and relatively uncommon in general English). Terms thus selected represent a subset of vocabulary that accurately characterizes the domain. Thesaurus discovery is quite fast8 and typically yields a subset of terminology that represents less than 5% of all the available terms in the corpus.9 Since the TREC experiments involved a heterogeneous collection of documents and since it was not possible to identify specific subsets of documents in the database as `about' one or another topic, it was not possible to discover and use relevant thesauri in TREC tasks. Thus, as shown in Figure 2, the simpllfied CLARIT indexing process in TREC tasks did not involve matching' of terms against a first-order thesaurus and did not result in three-way-categorized index terms. [OCRerr]An example of an appropriate sample might be 50 full-text articles involving "AIDS Research"; or 2,000 abstracts about "Silicon Engraving"; or even one's personal file of recent e-mail correspondence, provided it is sufficiently large and topically coherent. 8At present, using the CLARIT research system, a thesaurus can be found for a 3-megabyte corpus in less than 10 minutes on a DECstation 5000/200. 91n fact, the number of terms returned will vary depending on parameters the user selects when generating the thesaurus. 254