NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman 2.3 Vector-Space `Similarity' Measures The principal method used by the CLARIT system in comparing `information objects' (e.g., in retrieval, in routing) is vector-space distance.'0 The basic metric is that of `similarity' of terms. `Similarity' is determined by different procedures in different contexts. Partial or `fuzzy' matching of terms is facilitated by noting whether terms share words or attested subphrases. For example, in vector-space modeling of documents, the contained words of all terms (in the document vector as well as the query vector) are broken out, giving, in effect, the possibility of matching parts of terms, though, technically, the individual words are realized as independent dimensions of the term space.11 2.4 Notes on the Limited Version of CLARIT Processing in TREC Because of the time and space limitations in the task, the CLARIT team did not utilize several features of CLARIT processing that normally produce enhanced results. One of the features- the automatic `tokenization' or identification of proper names-would certainly have assisted processing of some topics. Another featur[OCRerr]the identification of equivalence classes of terms- also would have aided the task. In addition, no attempt was made to establish `uniform-length' documents or sub-documents (e.g., by setting a maximum word count or sentence length for such units). Though CLARIT processing supports the treatment of documents as sub-document collections, that feature of CLARIT processing was not utilized in the experiments. All topic statements were treated uniformly and simply: no attempt was made to handle implicit or explicit quantification, time intervals, satisfaction conditions, etc., except as literally encoded in the topics. Though CLARIT NLP modules can produce full sentence analyses or complex-NP analyses, neither of these features was utilized in TREC processing. All documents were processed only for simplex NPs; inevitably, some non-NP information was lost. In indexing TREC documents, term weights were based on a general IDF-TF score12 for topic `domains'. In the case of multi-word terms (the norm), the full terms are assigned an inde- pendent IDF-TF score, and each word in the term was broken out and assigned an independent IDF-TF score. While all CLARIT processing is designed to be fully automatic, we did not employ fully automatic processing in TREC tasks. In particular, there were two steps in the CLARIT- TREC process that required non-automatic processing: (1) initial review and weighting of the index terms automatically-nominated and derived from each topic statement and (2) review of first-pass retrieved documents to identify 5-10 relevant ones for `feedback'. The two steps involved minimal user intervention (and, in fact, required very little time and effort); however, they do qualify the CLARIT-TREC system as a manual process In general, we regard the CLARIT-TREC system as a minimal system for purposes of evaluation. The results of CLARIT-TREC processing are useful in helping us establish baseline performance for core but abbreviated CLARIT functions. 100f. [Salton & McGill 1983] for background on vector-space modeling in information retrieval applications. 11 Cf. [Evans et al. 1992) and [Hersh et al. 1992] for an evaluation of CLARIT vector-space `similarity' measures. 12"IDF-TF" represents the standard inver8e document frequenc[OCRerr] x intradocument term frequenc[OCRerr] score for terms. 255