NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman specified. 4 An altogether different situafion arises when the query actually requests that cer[OCRerr]un underspecified information is found before a docu- ment could be judged relevant. In TREC topics (e.g., 058, as in many others) the following request is corn- monpiace: to be relevant, the document will identiA the location of the strike or potential strike. What we ask the system here is to extract the value of ce[OCRerr][OCRerr][OCRerr]in variable that satisfies certain conditions, i.e., find X such that location-of-strike(X). It is impossible to properly evaluate such query using any kind of con- stant term based retrieval. What is required, at the minimum is a general pattern matching capability, and appropriately advanced representation of con- tents (including more careful NLP processes). In the remainder of this paper we discuss par- ticulars of the present system and some of the obser- vations made while processing TREC data. The above comments will provide the background for situafing our present effort and state-of-the-art with respect to where we should be in the future. OVERALL DESIGN Our information retrieval system consists of a traditional stafistical backbone (Harman and Candela, 1989) augmented with various natural language pro- cessing components that assist the system in database processing (stemming, indexing, word and phrase clustering, selectional restrictions), and translate a user's information request into an effective query. This design is a careful compromise between purely statistical non-linguistic approaches and those requir- ing rather accomplished (and expensive) semantic analysis of data, often referred to as `conceptual retrieval'. The conceptual retrieval systems, though quite effective, are not yet mature enough to be con- sidered in serious information retrieval applications, the major problems being their extreme inefficiency and the need for manual encoding of domain knowledge (Mauldin, 1991). However, as pointed out in the previous section, a more careful text process- ing may be required for certain types of requests. In our system the database text is first pro- cessed with a fast syntactic parser. Subsequently cer- tain types of phrases are extracted from the parse trees and used as compound indexing terms in addi- tion to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of similarity links between smaller subphrases and words occurring in them. A further filtering process maps these similarity links onto semantic relations (generalization, specialization, synonymy, etc.) after which they are used to transform user's request into a search query. The user's natural language request is also parsed, and all indexing terms occurring in them are identified. Certain highly ambiguous, usually single- word terms may be dropped, provided that they also occur as elements in some compound terms. For example, "natural" is deleted from a query already containing "natural language because "natural" occurs in many unrelated contexts: "natural number", "natural logarithm", "natural approach", etc. At the same time, other terms may be added, namely those which are linked to some query term through admis- sible similarity relations. For example, "unlawful activity" is added to a query (TREC topic 055) con- taining the compound term "illegal activity" via a synonymy link between "illegal" and "unlawful". After the final query is constructed, the database search follows, and a ranked list of documents is returned. It should be noted that all the processing steps, those performed by the backbone system, and these performed by the natural language processing corn- ponents, are fully automated, and no human interven- tion or manual encoding is required. FAST PARSING WITH TTP PARSER TTP (Tagged Text Parser) is based on the Linguistic String Grammar developed by Sager (1981). The parser currently encompasses some 400 grammar productions, but it is by 110 means complete. The parser's output is a regularized parse tree representation of each sentence, that is, a representa- tion that reflects the sentence's logical predicate- argument structure. For exainple, logical subject and logical object are identified in both passive and active sentences, and noun phrases are organized around their head elements. The significance of this representation will be discussed below. The parser is equipped with a powerful skip-and-fit recovery mechanism that allows it to operate effectively in the face of ill-formed input or under a severe time pres- sure. In the runs with approximately 83 million words of TREC's Wall Street Journal texts,5 the parser's speed averaged between 0.45 and 0.5 seconds per sentence, or up to 2600 words per minute, on a 21 MIPS SparcStation ELC. try. 4 This does not mean that the query has to accurately reflect the user's intentions. we take what we've got and give it our best 175 Approximately 0.5 (;Bytes of text, over 4 million sen- tences.