SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- The University of Massachusetts TIPSTER Project chapter W. B. Croft National Institute of Standards and Technology Donna K. Harman Query Type Average Precision 5 docs 30 docs man .66 man+weights .68 (+3.0%) man+EMIM+weights (1 (+7.6%) Man+LEN1I[OCRerr]I+weights .68 (+3.0%) .65 .63 (-3.1%) .61 (-6.2%) .64 (-1.5%) (20 topics) 200 docs .50 .48 (-4.0%) .49(-2.0%) .50 (0%) Table 2: TIPSTER Routing Results: weights are based on frequency in relevant documents, EMIM is a global selection measure, LEMIM is a local (window-based) selection measure. feedback techniques until experillients [OCRerr]vith larger sets of relevauce judgements are carried out. The third set of results [OCRerr]ire re1('[OCRerr]te(1 to tile retrieval of Japanese text. The goal of these experiments was to comp are different app roaches to morp hologi cal analysis or word seg- mentation. Japanese text is lfl(ide up of characters from a nun[OCRerr][OCRerr]er of alphabets (I(anji, Katakana, Hiragana, and Engb.sii). There are, however, no word separators and therefore a major part of indexing is deciding what to index. [OCRerr]Ve tested two alternatives: 1. An efficient, relatively crude technique where individual Kanji (Chinese) characters and St rings of Kat akana c h arac t ers are indexed. 2. A more sophisticated dictionary and grammar-based segmentation algorithm devel- oped at Kyoto University (JUMAN). There is a significant difference in the indexing times required by these techniques. With a database of 1,100 documents from a Japanese newspaper, the character-based indexing took 4 minutes while the word-based (JUMAN) indexing took 31 minutes. The relative effectiveness of the two text representations was then tested using the average precision in the top 10 documents for 30 queries. The queries were either treated as strings of characters, or were automatically structured using tile JUMAN segmenter. In the character-based ap- proach, words found in the query were expressed using the phrase operator to combine Kanji and Katakana characters. The results slLow that the retrieval performance using Japanese seems to be comparable to similar experiments with English databases, and the relatively simple character- based indexing technique is [OCRerr]irprisiiigly effective compared to more sophis- ticated word-based techniques. The latter result is interesting, but the experiment must be repeated when the larger TIPSTER Japanese dat[OCRerr])ase and query set becomes avallable. We are currently carrying out a range of more detailed experiments using the relevance judgements that are now avail[OCRerr])le. The results from these experiments will allow us to tune the techniques being used and to make more definite conclusions about their relative effectiveness. In addition, we will contiiuie to incorporate new approaches into the retrieval and routing software for the upcoming evaluations. 104