NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Incorporating Semantics Within a Connectionist Model and a Vector Processing Model chapter R. Boyd J. Driscoll National Institute of Standards and Technology D. K. Harman Considering Figure 12, the word "depart" occurs in the query one time and triggers the category AMDR. The word "leave" occurs in Document #4 once and also triggers the category AMDR. Thus, item 1 in Figure 12 corresponds to subsection (a) as described above. An example using sub- section (0) occurs in item 14 of Figure 12. Step 2. This step adjusts for words in the query that are not in any of the documents. Figure 13 shows the output of Step 2 for Document #4. In this step, another list is created from the list created in Step 1. For each item in the Step 1 list which has a word with undefined idi; this step replaces the word in the First Entry column by the word in the Second Entry column. For example, the word "depart" has an undefined idfas shown in Figure & Thus, the word "depart" in item 1 of Figure 12 should be replaced by the word "leave" from the Second Entry column. This is shown in item 1 of Figure 13. Likewise, the words "do" and "when" also have an undefined idf and are respectively replaced by the words from the Second Entry column. Step 3. This step calculates the weight of a semantic component in the query and calculates the weight of a semantic compo- nent in the document. Figure 14 shows the output of Step 3 for Document #4. In Step 3, another list is created from the list created in Step 2 as follows: For each item in the Step 2 list, follow either subsection (a) or (0), whichever applies: a. If the Third Entry specifies a category, then 1) Replace the First Entry by computing: ( i[OCRerr]of frequency ot\( probability the word[OCRerr] word in word in [OCRerr] triggers the category First Entry)[OCRerr] First Entry )[OCRerr] in the Third Entry ) 2) Replace the Second Entry by computing: ( i[OCRerr]of frequency o[OCRerr]( probability the word[OCRerr] word in [OCRerr] word in [OCRerr] triggers the category Second Entry) ksecond Entry)[OCRerr] lathe Third Entry ) 3) Omit the Third Entry. b. If the Third Entry does not specify a category, then 1) Replace the First Entry by computing: ( i[OCRerr]of irequencyo[OCRerr] wordin [OCRerr] wordin FirstEntry)[OCRerr] FirstEntry) 2) Replace the Second Entry by computing: ( i[OCRerr]of frequency word in [OCRerr] word in Second Entry) kSecond Entry) 3) Omit the Third Entry. In Figure 14, item 1 is an example of using subsection (a), and item 14 is an example of using subsection (0). 299 Step 4. This step multiplies the weights in the query by the weights in the document. The top portion of Figure 15 shows the output of Step 4. In the list created here, the numerical value created in the First Entry column of Figure 14 is multiplied by the numerical value created in the Second Entry column of Figure 14. Step S. This step sums the values in the Step 4 list to compute the semantic similarity coefficient fora particular document. The bottom portion of Figure 15 shows the output of step 5 for Document #4. We have finally observed an improved Precision[OCRerr]Recall performance using the semantic similarity coefficient explained here. or example, in a Category B filtering experiment where the words being considered were only those in the topics and idf values were determined by the number of topics a word is in, we have observed the keyword and semantic results shown in Figure 16 and Figure 17, respec- tively. The 11-pt average for these two experiments reveals a 23% increase due to the use of semantic categories. According to Sparck Jones' criteria, this change would be classified as "significant" [OCRerr]reater than 10.0%) [12]. We believe further improvement is possible by considering more words, stemming for plurals and tenses of words, better idf values (like those used for archival retrieval), a modem lexicon, and a focus on paragraphs instead of whole docu- ments. 5. Summary Our progress during ThEC-1 and ThEC-2 has been the following: a. We created efficient code for a UNIX platform. Originally our code used B+ tree structures for implementing inverted files on a DOS platform. We now use hashing to replace B+trees, establishing codes to replace character strings; and the UNIX platform provides faster processing than the DOS platform. b. We built an index forasemantic lexicon based on the public domain 1911 version of Roget's Thesaurus. To do this, we had to create our own category numbering system similar to today's version of Roget's Thesaurus. c. We solved part of the blend problem for semantic and keyword weights. We now base semantic category weights on the kifof words which generate the semantic categories. We can now index or scan TREC documents at rates faster than 60 Megabytes per hour depending on the workstation. We have a semantic lexicon of approximately 20,000 words with flexible category codes that allow a course (36 catego ries) through fine (more than 15,000 categories) semantic analysis. As shown in Section 4, our procedure for determining relevance is based on the senses of each word. For example, using the vector processing model and the similarity coefficient sim[OCRerr],D[OCRerr])- X Wqj[OCRerr]djj, i-i