SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
4. Construction of Query Vectors. The techniques used to compose query vectors from sam-
pie documents are problematic. For example, all terms from the contributing documents
were included in the query vector. It is clear that this will contribute `noise' to the final
vector.
5. Scoring of Query Vectors. The terms in the query vector were scored in their independent
documents, and then the scores were combined using addition. This has the desirable
effect of reinforcing terms that occur in several contributing documents, but it is a crude
mechanism for doing so.
Improvements can be made in many elements of the CLARIT-TREC system. It is clear
that we require further experimentation and analysis to evaluate the system.
7 Conclusion
The CLARIT system has been and is continuing to be developed as part of a university research
project. The specific configuration of the system used in the TREC experiments was developed
in less than one week. As a research prototype, the system has not been engineered for optimal
performance.
The TREC task was challenging, in part, because of the size of the corpus. The CLARIT-
TREC team had not previously worked with such a large database; we `invented' solutions to
many engineering problems on the fly. We were often inefficient.
Many text processing functions that are available in the CLARIT system or are near com-
pletion were not used on TREC documents. In future evaluations, we plan to utilize some of the
more sophisticated functionality in the system. For example, we have been developing gram-
mars for recognizing complex tokens such as proper names, dates, times, monetary values, etc.,
but did not use token recognition modules in CLARIT-TREC processing. We believe that such
token recognition would greatly improve the results for queries involving specific persons or time
intervals. In addition, we believe that it will be possible to improve results by taking advantage
of sub-document scoring. By dividing a long document into smaller, multi-paragraph units
we will be able to score documents more accurately with respect to a topic. Finally, we have
also been experimenting with generating sub-corpus-derived equivalence classes for words and
terms. Equivalence classes will make it possible to expand query terms precisely and selectively.
Clearly, there is a great deal more to be learned from the TREC experiment. In our contin-
uing analyses, we will attempt to parameterize CLARIT performance and to experiment with
extensions of CLARIT functionality that may result in superior retrieval.
8 Acknowledgements
CLARIT team participation in the TREC activities was made possible, in part, by grants
from DARPA/NIST and from the Digital Equipment Corporation. In addition, several groups
at Carnegie Mellon University-including the Laboratory for Computational Linguistics, the
University Libraries, and the School of Computer Scienc[OCRerr]provided resources to the CLARIT
team. All the groups that supported our effort have our sincerest thanks and appreciation.
285