NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Incorporating Semantics Within a Connectionist Model and a Vector Processing Model chapter R. Boyd J. Driscoll National Institute of Standards and Technology D. K. Harman top I pI=O.02 0.9 0.9 0.9 e (army) (engineer) (plant) Figure 4: A Simple Network Consisting of Two Document Nodes and Three Index Term Nodes. receives information about the activation levels of its immediate neighboring nodes (nodes connected to it via direct links), and then uses this information to calculate its own activation leveL Through this process of spreading activation, the network setUes down to equilibrium repre- senting a retrieval to a user's information need. The computation of the information retrieval inference process is based on a lormaliration of the causal and proba- bilistic associative knowledge underlying diagnostic prob- lem-solving [18). We do not discuss the formulation architecture and activation mechanism of the connectionist model. This information can be found in [11,13,16,18). For TREC-2, we managed to complete only one official routing experiment for this approach, and itdid not involve semantics. The experiment was intended to be a baseline experiment for our semantic experiments. For ThEC-2, a specific network was constructed for 50 topics. A list of index terms was assembled based on keywords in the concept section of each topic. In this network, each output node represented a topic, and each input node represented akeyword. The prior probability assigned to each topic node was equal to 1/(total number of topics). The connection strengths were assigned equal weights (0.9). The network contained 50 topic nodes and 848 index term nodes. These nodes were connected via 1449 links. An example of this network is shown in Figure 5, where p[OCRerr] is the prior probability of topic top[OCRerr]. The keywords "army", "engineer", and "plant" were obtained by processing the concept section of topic tO[OCRerr][OCRerr] Currently, the network is enhanced by using an estimated weighting scheme. We performed a Category B routing experiment. Using just keywords, the results were not good. The main problem was due to the fact that, in the document ranking, many documents had the same score used to generate the ranking. In order to satisfy the requirements for the ranking, we had to artificially rank those documents with the same score. This was done based on order of appearance. The performance was terrible except for Topic 66. This topic had only two 295 Figure 5: A Sample Network of the Experimental ModeL known relevant documents for Category B routing experi- ments and our inference network retrieved one of them in the top 20 documents! No further connectionist model experiments have been completed. We were unable to modify the baseline keyword experiment or perform semantic experiments for this approach. 4. Vector Processing Model Experiments In this section, we explain the manner in which semantics is incorporated within a vector processing model using the semantic lexicon explained in Section 2. Please note that an entry in our semantic lexicon has the form of a word followed by codes for each of the semantic categories theword triggers. We explain our approach usingatext relevance determination procedure intended to show what is being calculated rather than show the actual computations for the approach. The procedure presented here generates several outputs that are really not necessary, but are included just to help explain the approach. The relevance determination procedure is explained using the four documents and query shown in Figure 6. A few preliminary computations are reviewed in order to explain the procedure. First, the number of documents each word is in must be determined. Figure 7 shows a list of words from the four documents and the query of Figure 6 along with the number of documents each word is in (dJ). Next, the inverse document frequency (idi) of each word is determined by the equation 1og10(NIdJ), where N -4, the total number of documents. Figure 8 provides the idjof each word. Sometimes, the kif of a word is undefined. This can happen when a word does not occur in the documents but does occur a query. For example, the words "depart" "do" in and "when" do not appear in the four documents. Thus, the idf of these terms cannot be defined here. Later, we will see that an adjustment can be made for these undefined values. Next, the category probability of each query word is determined. Figure 9 shows an alphabetized list of all the unique words from the query, the frequency of each word in the query, and the semantic categories each word triggers.