SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Incorporating Semantics Within a Connectionist Model and a Vector Processing Model
chapter
R. Boyd
J. Driscoll
National Institute of Standards and Technology
D. K. Harman
Queryid (Num): 47 of 50
Total number of documents over all queries
Retrieved: 36610
Relevant: 2064
Rel[OCRerr]ret: 913
Interpolated Recall - Precision Averages
at 0.00 0.3514
at 0.10 0.1968
at 0.20 0.1367
at 0.30 0.1082
at 0.40 0.0894
at 0.50 0.0752
at 0.60 0.0276
at 0.70 0.0105
at 0.80 0.0062
at 0.90 0.0013
at 1.00 0.0007
(non-interpolated) over all rel does
0.0746
Average precision
Queryid [OCRerr]um): 47 of 50
Total number of documents over all queries
Retrieved: 36383
Relevant: 2064
Rel_ret: 956
Interpolated Recall
at 0.00
at 0.10
at 0.20
at 0.30
at 0.40
at 0.50
at 0.60
at 0.70
at 0.80
at 0.90
at 1.00
Average precision
- Precision Averages
0.3961
0.2479
0.1734
0.1258
0.1067
0.0838
0.0372
0.0195
0.0100
0.0029
0.0009
(non-interpolated) over all rel does
0.0919
Precision: Precision:
At 5does: 0.1660 At Sdocs: 0.2426
At lOdocs: 0.1532 At lOdoes: 0.2149
At 15does: 0.1433 At lSdoes: 0.1801
At 20does: 0.1298 At 20does: 0.1574
At 30does: 0.1057 At 30does: 0.1383
At 100does: 0.0643 At l00does: 0.0745
At 200 does: 0.0465 At 200does: 0.0522
At 500 does: 0.0302 At 500 does: 0.0320
At 1000does: 0.0194 At 1000 does: 0.0203
R-Precision ([`recision after R (= num_rel for a query) R-Precision ([`recision after R (= num[OCRerr]rel for a query)
does retrieved): does retrieved):
Exact: 0.1035 Exact: 0.1283
Figure 16. Fillering Using Keywords.
if the word "trains" is in the Query and the word "leaves "is
in the Document and we look at the semantic category Motion
with Reference to Direction (AMDR), then one of the vector
product elements in the formula becomes:
. p",abE.Iiy [OCRerr]
Icavee" triggem AMDR[OCRerr]
where the probabilities are obtained from our semantic lexi-
con.
We plan to do more experiments incorporating the fol-
lowing improvements:
a. Modernize the semantic lexicon. Since our lexicon isbased
on the 1911 version of Roget's Thesaurus, many modem
words are not present and the senses of recorded words are
not accurate. We plan to correct this. For example, we
could try to get permission to use the current version of
Roget's Thesaurus.
b. Base similarity on paragraphs instead of whole documents.
We have had success using as few as 36 categories in a
paragraph environment. [OCRerr]e also feel that relevance
301
Figure 17. Filtering Using Semantic Categories.
decisions are made by humans looking at roughly a
paragraph of information. We plan to modify our code to
use paragraphs as a basis for the similarity measure.
c. Experiment with the number of possible semantic cate-
gories and the probability assigned to a triggered category.
The experiment behind the performance improvement
shown in Figure 16 and Figure 17 uses a very fine number
of semantic categories and treats the triggered semantic
categories for a word uniformly. We plan to experiment
with a fewer number of categories, and we plan to obtain
a probability distribution for categories based on word
usage.
Basically, we are trying to establish a statistically sound
approach to using word sense information. Intuition is that
word sense information should improve retrieval perform-
ance. Furthermore, our approach to using word sense infor-
mation has shown a significant performance improvement in
a question/answer environment where paragraphs represent
documents. We feel that other word sense approaches, such
as query expansion or word sense disambiguation, may not
be statistically sound, and that may be why successful
experiments have not been reported.