SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Incorporating Semantics Within a Connectionist Model and a Vector Processing Model chapter R. Boyd J. Driscoll National Institute of Standards and Technology D. K. Harman Queryid (Num): 47 of 50 Total number of documents over all queries Retrieved: 36610 Relevant: 2064 Rel[OCRerr]ret: 913 Interpolated Recall - Precision Averages at 0.00 0.3514 at 0.10 0.1968 at 0.20 0.1367 at 0.30 0.1082 at 0.40 0.0894 at 0.50 0.0752 at 0.60 0.0276 at 0.70 0.0105 at 0.80 0.0062 at 0.90 0.0013 at 1.00 0.0007 (non-interpolated) over all rel does 0.0746 Average precision Queryid [OCRerr]um): 47 of 50 Total number of documents over all queries Retrieved: 36383 Relevant: 2064 Rel_ret: 956 Interpolated Recall at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision - Precision Averages 0.3961 0.2479 0.1734 0.1258 0.1067 0.0838 0.0372 0.0195 0.0100 0.0029 0.0009 (non-interpolated) over all rel does 0.0919 Precision: Precision: At 5does: 0.1660 At Sdocs: 0.2426 At lOdocs: 0.1532 At lOdoes: 0.2149 At 15does: 0.1433 At lSdoes: 0.1801 At 20does: 0.1298 At 20does: 0.1574 At 30does: 0.1057 At 30does: 0.1383 At 100does: 0.0643 At l00does: 0.0745 At 200 does: 0.0465 At 200does: 0.0522 At 500 does: 0.0302 At 500 does: 0.0320 At 1000does: 0.0194 At 1000 does: 0.0203 R-Precision ([`recision after R (= num_rel for a query) R-Precision ([`recision after R (= num[OCRerr]rel for a query) does retrieved): does retrieved): Exact: 0.1035 Exact: 0.1283 Figure 16. Fillering Using Keywords. if the word "trains" is in the Query and the word "leaves "is in the Document and we look at the semantic category Motion with Reference to Direction (AMDR), then one of the vector product elements in the formula becomes: . p",abE.Iiy [OCRerr] Icavee" triggem AMDR[OCRerr] where the probabilities are obtained from our semantic lexi- con. We plan to do more experiments incorporating the fol- lowing improvements: a. Modernize the semantic lexicon. Since our lexicon isbased on the 1911 version of Roget's Thesaurus, many modem words are not present and the senses of recorded words are not accurate. We plan to correct this. For example, we could try to get permission to use the current version of Roget's Thesaurus. b. Base similarity on paragraphs instead of whole documents. We have had success using as few as 36 categories in a paragraph environment. [OCRerr]e also feel that relevance 301 Figure 17. Filtering Using Semantic Categories. decisions are made by humans looking at roughly a paragraph of information. We plan to modify our code to use paragraphs as a basis for the similarity measure. c. Experiment with the number of possible semantic cate- gories and the probability assigned to a triggered category. The experiment behind the performance improvement shown in Figure 16 and Figure 17 uses a very fine number of semantic categories and treats the triggered semantic categories for a word uniformly. We plan to experiment with a fewer number of categories, and we plan to obtain a probability distribution for categories based on word usage. Basically, we are trying to establish a statistically sound approach to using word sense information. Intuition is that word sense information should improve retrieval perform- ance. Furthermore, our approach to using word sense infor- mation has shown a significant performance improvement in a question/answer environment where paragraphs represent documents. We feel that other word sense approaches, such as query expansion or word sense disambiguation, may not be statistically sound, and that may be why successful experiments have not been reported.