SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- The University of Massachusetts TIPSTER Project
chapter
W. B. Croft
National Institute of Standards and Technology
Donna K. Harman
Query Type Average Precision
5 docs 30 docs
man .66
man+weights .68 (+3.0%)
man+EMIM+weights (1 (+7.6%)
Man+LEN1I[OCRerr]I+weights .68 (+3.0%)
.65
.63 (-3.1%)
.61 (-6.2%)
.64 (-1.5%)
(20 topics)
200 docs
.50
.48 (-4.0%)
.49(-2.0%)
.50 (0%)
Table 2: TIPSTER Routing Results: weights are based on frequency in relevant documents,
EMIM is a global selection measure, LEMIM is a local (window-based) selection measure.
feedback techniques until experillients [OCRerr]vith larger sets of relevauce judgements are carried
out.
The third set of results [OCRerr]ire re1('[OCRerr]te(1 to tile retrieval of Japanese text. The goal of these
experiments was to comp are different app roaches to morp hologi cal analysis or word seg-
mentation. Japanese text is lfl(ide up of characters from a nun[OCRerr][OCRerr]er of alphabets (I(anji,
Katakana, Hiragana, and Engb.sii). There are, however, no word separators and therefore
a major part of indexing is deciding what to index. [OCRerr]Ve tested two alternatives:
1. An efficient, relatively crude technique where individual Kanji (Chinese) characters
and St rings of Kat akana c h arac t ers are indexed.
2. A more sophisticated dictionary and grammar-based segmentation algorithm devel-
oped at Kyoto University (JUMAN).
There is a significant difference in the indexing times required by these techniques. With
a database of 1,100 documents from a Japanese newspaper, the character-based indexing
took 4 minutes while the word-based (JUMAN) indexing took 31 minutes. The relative
effectiveness of the two text representations was then tested using the average precision in
the top 10 documents for 30 queries. The queries were either treated as strings of characters,
or were automatically structured using tile JUMAN segmenter. In the character-based ap-
proach, words found in the query were expressed using the phrase operator to combine Kanji
and Katakana characters. The results slLow that the retrieval performance using Japanese
seems to be comparable to similar experiments with English databases, and the relatively
simple character- based indexing technique is [OCRerr]irprisiiigly effective compared to more sophis-
ticated word-based techniques. The latter result is interesting, but the experiment must be
repeated when the larger TIPSTER Japanese dat[OCRerr])ase and query set becomes avallable.
We are currently carrying out a range of more detailed experiments using the relevance
judgements that are now avail[OCRerr])le. The results from these experiments will allow us to
tune the techniques being used and to make more definite conclusions about their relative
effectiveness. In addition, we will contiiuie to incorporate new approaches into the retrieval
and routing software for the upcoming evaluations.
104