SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
CLARIT TREC Design, Experiments, and Results
chapter
D. Evans
R. Lefferts
G. Grefenstette
S. Handerson
W. Hersh
A. Archbold
National Institute of Standards and Technology
Donna K. Harman
2.3 Vector-Space `Similarity' Measures
The principal method used by the CLARIT system in comparing `information objects' (e.g.,
in retrieval, in routing) is vector-space distance.'0 The basic metric is that of `similarity' of
terms. `Similarity' is determined by different procedures in different contexts. Partial or `fuzzy'
matching of terms is facilitated by noting whether terms share words or attested subphrases.
For example, in vector-space modeling of documents, the contained words of all terms (in the
document vector as well as the query vector) are broken out, giving, in effect, the possibility of
matching parts of terms, though, technically, the individual words are realized as independent
dimensions of the term space.11
2.4 Notes on the Limited Version of CLARIT Processing in TREC
Because of the time and space limitations in the task, the CLARIT team did not utilize several
features of CLARIT processing that normally produce enhanced results. One of the features-
the automatic `tokenization' or identification of proper names-would certainly have assisted
processing of some topics. Another featur[OCRerr]the identification of equivalence classes of terms-
also would have aided the task.
In addition, no attempt was made to establish `uniform-length' documents or sub-documents
(e.g., by setting a maximum word count or sentence length for such units). Though CLARIT
processing supports the treatment of documents as sub-document collections, that feature of
CLARIT processing was not utilized in the experiments.
All topic statements were treated uniformly and simply: no attempt was made to handle
implicit or explicit quantification, time intervals, satisfaction conditions, etc., except as literally
encoded in the topics.
Though CLARIT NLP modules can produce full sentence analyses or complex-NP analyses,
neither of these features was utilized in TREC processing. All documents were processed only
for simplex NPs; inevitably, some non-NP information was lost.
In indexing TREC documents, term weights were based on a general IDF-TF score12 for
topic `domains'. In the case of multi-word terms (the norm), the full terms are assigned an inde-
pendent IDF-TF score, and each word in the term was broken out and assigned an independent
IDF-TF score.
While all CLARIT processing is designed to be fully automatic, we did not employ fully
automatic processing in TREC tasks. In particular, there were two steps in the CLARIT-
TREC process that required non-automatic processing: (1) initial review and weighting of the
index terms automatically-nominated and derived from each topic statement and (2) review
of first-pass retrieved documents to identify 5-10 relevant ones for `feedback'. The two steps
involved minimal user intervention (and, in fact, required very little time and effort); however,
they do qualify the CLARIT-TREC system as a manual process
In general, we regard the CLARIT-TREC system as a minimal system for purposes of
evaluation. The results of CLARIT-TREC processing are useful in helping us establish baseline
performance for core but abbreviated CLARIT functions.
100f. [Salton & McGill 1983] for background on vector-space modeling in information retrieval applications.
11 Cf. [Evans et al. 1992) and [Hersh et al. 1992] for an evaluation of CLARIT vector-space `similarity' measures.
12"IDF-TF" represents the standard inver8e document frequenc[OCRerr] x intradocument term frequenc[OCRerr] score for
terms.
255