NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Incorporating Semantics Within a Connectionist Model and a Vector Processing Model chapter R. Boyd J. Driscoll National Institute of Standards and Technology D. K. Harman The semantic categories in our example are those shown in Figure 1. For example, consider the word "depart" which occurs one time in the query as shown in Figure 9. The semantic lexicon entry for the word 11depart't using the categories of Figure 1 is as follows: depart: NONE NONE NONE NONE NONE AMDR AMDR TA[OCRerr] where NONE represents a word sense not included in the 36 semantic categories of Figure 1. If a uniform distribution is assumed, then AMDR is triggered 1/4 of the time and TA[OCRerr] is triggered 1/8 of the time. This is shown in Figure 9 as the probabilities for each semantic category. A similar category probability determination is done for each document. Figure 10 is an alphabetized list of all the unique words in Document #4 of Figure 6. The semantic categories each word triggers along with probabilities are also shown. The text relevance determination procedure is shown in Figure 11. The procedure uses three input lists: a. List of words and the kif of each word, as shown in Figure 8. b. List of words in the query and the semantic categories they trigger along with the probability of triggering those categories, as shown in Figure 9. c. List of words in a document and the semantic categories they trigger along with the probability of triggering those categories, as shown in Figure 10. The procedure operates as follows: Step 1. This step determines the common meanings between the query and the document. Figure 12 corresponds to the output of Step 1 for Document #4. In Step 1, a new list is created as follows: For each word in the query, follow either subsection (a) or [OCRerr]), whichever applies: a. For each category the word triggers, find each word in the document that triggers the category and output three things: 1) The word in the query and its frequency of occurrence. 2) The word in the document and its frequency of occurrence. 3) The category. b. If the word does not trigger a category, then look for the word in the document and if found, output two things and a [OCRerr] 1) The word in the query and its frequency of occurrence. 2) The word in the document and its frequency of occurrence. 297 word frequency category probability hourly 1 `rflM 1.0 leave 1 AMDR 1/7 TA[OCRerr] 1/7 noon 1 AU)M 1/3 [OCRerr]flM 2/3 the 1 station 1 APOS 3/16 AORD 1/8 TA[OCRerr][[OCRerr] 1/16 TCNP 1/8 ThGR 1/16 J[OCRerr]PL 3/16 trains 1 AORD 7/24 AMDR 1/12 AMFR 1/12 TACM 1/24 TCNV 1/12 until 1 THM 1.0 Figure 10. Words in Document #4. [OCRerr]tep 1 - Refer to Figure 12. Determine common meaning between query and the document. [OCRerr]tep 2- Refer to Figure 13. Adjust for words in the query that are not in any of the documents. [OCRerr]tep 3 - Refer to Figure 14. Calculate the weight of a semantic component in the query and calculate the weight of a semantic component in the document. [OCRerr]tep 4- Refer to Figure 15. Multiply the weight in the query by the weight in the document. [OCRerr]tep 5 - Refer to Figure 15. Sum all the individual products of Step 4 into a single value which is the semantic similarity coefficient. I 1 Figure 11. Relevance Determination Procedure to Explain Semantic Similarity.