SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
2.1 Selective NLP to Nominate Information Units
In brief, the CLARIT indexing process as shown in Figure 1 involves several steps, one of
which utilizes selective natural-language processing (NLP) to identify noun phrases (NPs) in
texts, which are taken as the relevant information units in all further processing. Subsequent
steps take advantage of several statistical measures of `importance' to evaluate NPs as potential
index terms. One special feature of CLARIT processing is the use of an automatically-generated
`first-order' thesaurus for a domain to support the selection of appropriate terms. The standard
CLARIT process returns three categories of index terms, corresponding to terms that (1) occur
in the document and exactly match terms in the thesaurus, (2) terms that are in the thesaurus
and are more general than near-matching terms in the document, and (3) terms that are `novel'
to the document and not found in the thesaurus. In addition to being categorized as an exact,
general, or novel index term, each term is given a numerical relevance weight deemed to reflect
its relative value in characterizing the contents of the document.
2.2 `Thesaurus Discovery' to Nominate Sets of Terms for Collections
First-order thesauri are `discovered' via another CLARIT process, distinct from indexing. The
process requires a sample of documents representing a `domain'. The sample must be moder-
ately large (e.g., minimally 2 megabytes of text) and must be composed of documents that are
more or less `about' the topic of the domain.7
In general, CLARIT `thesaurus discovery' comprises algorithms and techniques for cluster-
ing phrases in collections of documents to construct first-order thesauri that optimally `cover'
an arbitrary percentage of all the terminology in the domain represented by the document col-
lection. `Normal' thesaurus discovery involves (1) decomposition of candidate NPs from the
documents to build a term lattice in which nodes are organized hierarchically from words to
phrases based on the number of phrases subsumed by the term associated with each node and
(2) selection of nodes that have high subsumption scores and that also satisfy certain structural
and statistical characteristics (such as being legitimate NPs, well distributed in the corpus, and
relatively uncommon in general English). Terms thus selected represent a subset of vocabulary
that accurately characterizes the domain. Thesaurus discovery is quite fast8 and typically yields
a subset of terminology that represents less than 5% of all the available terms in the corpus.9
Since the TREC experiments involved a heterogeneous collection of documents and since
it was not possible to identify specific subsets of documents in the database as `about' one or
another topic, it was not possible to discover and use relevant thesauri in TREC tasks. Thus,
as shown in Figure 2, the simpllfied CLARIT indexing process in TREC tasks did not involve
matching' of terms against a first-order thesaurus and did not result in three-way-categorized
index terms.
[OCRerr]An example of an appropriate sample might be 50 full-text articles involving "AIDS Research"; or 2,000
abstracts about "Silicon Engraving"; or even one's personal file of recent e-mail correspondence, provided it is
sufficiently large and topically coherent.
8At present, using the CLARIT research system, a thesaurus can be found for a 3-megabyte corpus in less
than 10 minutes on a DECstation 5000/200.
91n fact, the number of terms returned will vary depending on parameters the user selects when generating
the thesaurus.
254