SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) The QA System chapter J. Driscoll J. Lautenschlager M. Zhao National Institute of Standards and Technology Donna K. Harman For the TREC experiments, weused nine IBMPS/2 Model 95 computers. These were 50MHz 486 computers, each with 8 megabytes of RAM and one 400 megabyte hard drive. Two of the machines had 16 megabytes of RAM and another one had two 400 megabyte hard drives. A 33 Mf[OCRerr] 486 PC was used to distribute text to the nine IBM machines, and then collect information for merging and redistribution. Our plan was to put each of the nine Vol.1 and Vol.2 document collections on a separate machine, determine the document frequency for each term tj on each machine, and then merge the results and determine the inverse document frequency for each term in the entire collection. Next, we would index and query each document collection separately, and finally merge the retrieval results. The vector processing model is the basis for our approach. Terms used as document identifiers are keywords modified by various techniques such as stop lists, stemming, synonyms, and query reformulation. The calculation of the weighting factor (w) for a term in a document is a combination of term frequency (ti), document frequency (dJ), and inverse document frequency (id]). The basic definitions are as follows: tj[OCRerr], = number of occurrences of term tj in document D[OCRerr] df1 - number of documents in a collection which contain tj idf[OCRerr] - Iog(NIdf,), where N = total number of documents W.. =[OCRerr]f.*jdf £j `J,5J When the system is used to query acollection ofdocuments with I terms, the system computes a vector Q equal to (Wql,Wq2,..., Wqg) representing the weights for each term in the query. The retrieval of a document with vector D[OCRerr] equal to 4') representing the weights of each term in the document is based on the value of a similarity measure between the query vector and the document vector. For the NIST experiments, we used the Jaccard similarity coefficient [7] to retrieve documents for the September deadline. Jaccard Coefficient X Wqjdjj j-1 sim[OCRerr],D[OCRerr]) = I I * X(41)2+ X(w[OCRerr]1)2- X Wq[OCRerr]j j-1 j-1 1-1 3. SemantIc App[OCRerr]ach Mthough the basic IR approach has shown some success in regard to natural language queries, it ignores some valuable information. For instance, consider the following query: How long does the payload crew go through training before a launch? The typical IR system dismisses the following words in the query as useless or empty: "how", "does", "the", "through", 200 "before", and "a". Some of these words contain valuable semantic information. The following list indicates some of the semantic information triggered by a few of these words: how long Duration, Time through I£[OCRerr]tion/Space, Motion with Reference to Direction, Time before 1[OCRerr]tion[OCRerr]pace, Time The database concept of semantic modeling and the linguistic concept of thematic roles can help in this regard. 3.1 Semantic Modeling Semantic modeling was an object of considerable database research in the late 1970's and early 1980's. Abriefoverview can be found in [2]. Essentially, the semantic modeling approach identified concepts useful in talking informally about the real world. These concepts included the two notions of entities (objects in the real world) and relationships among entities (actions in the real world). Both entities and rela- tionships have properties. The properties of entities are often called attributes. There are basic or surface level attributes for entities in the real world. Examples of surface level entity attributes are General Dimensions, Color, and Position. These properties are prevalent in natural language. For example, consider the phrase "large, black book on the table" which indicates the General Dimensions, Color, and Position of the book. In linguistic research, the basic properties of relationships are discussed and called thematic roles. Thematic roles are also referred to in the literature as partidpant roles, semantic roles and case roles. Examples of thematic roles are Bene- ficiary and Time. Thematic roles are prevalent in natural language; they reveal how sentence phrases and clauses are semantically related to the verbs in a sentence. For example, consider the phrase "purchase for Mary on Wednesday" which indicates who benefited from a purchase (13eneficiary) and when a purchase occurred (Fime). The goal of our approach is to detect thematic information along with attribute information contained in natural Ian- guage queries and documents. When the information is present, our system uses it to help find the most relevant document. In order to use this additional information, the basic underlying concept of text relevance as presented earlier needs to be modified. The major modifications include the addition of a lexicon with thematic and attribute information, and a modified computation of the similarity coefficient. We now discuss our semantic lexicon. 3.2 The Semantic lexicon Our system uses a thesaurus as a source of semantic categories (thematic and attribute information). For example, Roget's Thesaurus contains a hierarchy of word classes to relate word senses [6]. For our research, we have selected several classes from this hierarchy to be used for semantic categories. We have defined thirty-six semantic categories as shown in Figure 1.