SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
A sample of a document after morphological analysis is given in Figure 7. A sample of the
same document after simplex-NP extraction is given in Figure 8. Note that "owned holding
company" and "$295" or "$370" are treated as NPs along with legitimate phrases like "comput-
erized industrial control system". While CLARIT does have facilities to discover and eliminate
inappropriate participles (such as "owned" in isolation) and can recognize nonce adjectives,
such as "state-owned", such processing was not employed in the TREC tasks. Hence, the cor-
rect expression, "Italian state-owned holding company" was not found or used in this case. In
addition, as noted previously, the CLARIT-TREC system did not `tokenize' company names
or dates or other `regular-expression'-like phrases; there was no time in our schedule for such
processing.
All NLP (and other) processing steps were piped through the system; intermediate files
were not retained. The parsed representation of all the texts took up approximately 98% of the
space occupied by the original text. Intermediate (but unretained) files generated in CLARIT
processing included a file of the words in each text, in their original order, annotated with
morphological categories. Other files contained the output of the parser as a list of NPs in the
order in which they occurred in each text. The parsed representation of the text was retained
and used at all subsequent steps of processing. Indeed, hereafter, unless otherwise specified,
any reference to a document or collection of documents refers to the CLARIT representation
of the text, viz., a sequence of normalized simplex NPs.15
4.3 Identifying Terms from Topics
All fields of topic statements, such as given in Figure 9, were similarly processed for NPs.
Team members reviewed the NPs and assigned weights of "1", "2", or "3" to each NP according
to whether the term was central or peripheral to the topic. (Some extracted NPs were discarded
as irrelevant or ill-formed; the vast majority were retained.) A sample set of weighted terms
for the topic in Figure 9 is given in Figure 10. The manual review and weighting of terms from
the topic statement took less than 5 minutes per topic. All subsequent processing of the query
was performed automatically.
4.4 Establishing Sets of `Relevant' Documents
Given the need to `evoke' candidate documents and to `partition' the database into subsets
that were easier to manage, we were naturally interested in identifying features in the topics
that would be useful as discriminators. We had little confidence, however, that the specific
terms in topics, which constitute the "source query", were either most respresentative of the
domain of the topic (= the `satisfaction class') or reasonably comprehensive. We thus decided
to supplement the source query with additional terms.
In particular, we used the CLARIT thesaurus-discovery technique on known relevant doc-
uments to identify terminology that might be better representative of the satisfaction-class
documents than the source query alone. The process produced a list of terms from the avail-
able topic-relevant documents (or from a small sample of relevent documents that we may have
found) and automatically nominated the top (approximately 20%) ranked terms to supplement
the original query (as derived from the topic statement) to produce a "routing/partitioning
thesaurus" for the topic.
Since the routing topics already had accompanying relevant documents, we used these as
a source of additional terminology. Ad-hoc queries, on the other hand, had no associated
relevant documents, so we designed a preliminary, partial `retrieval' step that would help us
15From the point of view of the CLARIT system, the information in a document is entirely represented by the
extracted noun phrases.
261