SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Incorporating Semantics Within a Connectionist Model and a Vector Processing Model
chapter
R. Boyd
J. Driscoll
National Institute of Standards and Technology
D. K. Harman
Considering Figure 12, the word "depart" occurs in the
query one time and triggers the category AMDR. The word
"leave" occurs in Document #4 once and also triggers the
category AMDR. Thus, item 1 in Figure 12 corresponds to
subsection (a) as described above. An example using sub-
section (0) occurs in item 14 of Figure 12.
Step 2.
This step adjusts for words in the query that are not in any
of the documents. Figure 13 shows the output of Step 2 for
Document #4. In this step, another list is created from the list
created in Step 1. For each item in the Step 1 list which has
a word with undefined idi; this step replaces the word in the
First Entry column by the word in the Second Entry column.
For example, the word "depart" has an undefined idfas shown
in Figure & Thus, the word "depart" in item 1 of Figure 12
should be replaced by the word "leave" from the Second Entry
column. This is shown in item 1 of Figure 13. Likewise, the
words "do" and "when" also have an undefined idf and are
respectively replaced by the words from the Second Entry
column.
Step 3.
This step calculates the weight of a semantic component
in the query and calculates the weight of a semantic compo-
nent in the document. Figure 14 shows the output of Step 3
for Document #4. In Step 3, another list is created from the
list created in Step 2 as follows:
For each item in the Step 2 list, follow either subsection (a)
or (0), whichever applies:
a. If the Third Entry specifies a category, then
1) Replace the First Entry by computing:
( i[OCRerr]of frequency ot\( probability the word[OCRerr]
word in word in [OCRerr] triggers the category
First Entry)[OCRerr] First Entry )[OCRerr] in the Third Entry )
2) Replace the Second Entry by computing:
( i[OCRerr]of frequency o[OCRerr]( probability the word[OCRerr]
word in [OCRerr] word in [OCRerr] triggers the category
Second Entry) ksecond Entry)[OCRerr] lathe Third Entry )
3) Omit the Third Entry.
b. If the Third Entry does not specify a category, then
1) Replace the First Entry by computing:
( i[OCRerr]of irequencyo[OCRerr]
wordin [OCRerr] wordin
FirstEntry)[OCRerr] FirstEntry)
2) Replace the Second Entry by computing:
( i[OCRerr]of frequency
word in [OCRerr] word in
Second Entry) kSecond Entry)
3) Omit the Third Entry.
In Figure 14, item 1 is an example of using subsection (a),
and item 14 is an example of using subsection (0).
299
Step 4.
This step multiplies the weights in the query by the weights
in the document. The top portion of Figure 15 shows the
output of Step 4. In the list created here, the numerical value
created in the First Entry column of Figure 14 is multiplied
by the numerical value created in the Second Entry column
of Figure 14.
Step S.
This step sums the values in the Step 4 list to compute the
semantic similarity coefficient fora particular document. The
bottom portion of Figure 15 shows the output of step 5 for
Document #4.
We have finally observed an improved Precision[OCRerr]Recall
performance using the semantic similarity coefficient
explained here. or example, in a Category B filtering
experiment where the words being considered were only those
in the topics and idf values were determined by the number
of topics a word is in, we have observed the keyword and
semantic results shown in Figure 16 and Figure 17, respec-
tively. The 11-pt average for these two experiments reveals
a 23% increase due to the use of semantic categories.
According to Sparck Jones' criteria, this change would be
classified as "significant" [OCRerr]reater than 10.0%) [12]. We
believe further improvement is possible by considering more
words, stemming for plurals and tenses of words, better idf
values (like those used for archival retrieval), a modem
lexicon, and a focus on paragraphs instead of whole docu-
ments.
5. Summary
Our progress during ThEC-1 and ThEC-2 has been the
following:
a. We created efficient code for a UNIX platform. Originally
our code used B+ tree structures for implementing inverted
files on a DOS platform. We now use hashing to replace
B+trees, establishing codes to replace character strings;
and the UNIX platform provides faster processing than the
DOS platform.
b. We built an index forasemantic lexicon based on the public
domain 1911 version of Roget's Thesaurus. To do this,
we had to create our own category numbering system
similar to today's version of Roget's Thesaurus.
c. We solved part of the blend problem for semantic and
keyword weights. We now base semantic category weights
on the kifof words which generate the semantic categories.
We can now index or scan TREC documents at rates faster
than 60 Megabytes per hour depending on the workstation.
We have a semantic lexicon of approximately 20,000 words
with flexible category codes that allow a course (36 catego
ries) through fine (more than 15,000 categories) semantic
analysis. As shown in Section 4, our procedure for
determining relevance is based on the senses of each word.
For example, using the vector processing model and the
similarity coefficient
sim[OCRerr],D[OCRerr])- X Wqj[OCRerr]djj,
i-i