SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) TREC-2 Document Retrieval Experiments using PIRCS chapter K. Kwok L. Grunfeld National Institute of Standards and Technology D. K. Harman some obvious non-content phrases (such as `author describ', paper attempt', `two way'). The result is 13,787 pairs. These plus the previously prepared manually list (which contalus mostly of phrases with at least ore stopword as well as some phrases identified in the `topics') are then treated as if they were single index terms to be identified during document and query processing [OCRerr]US[OCRerr]3J. They have their own global and local usage statistics, and can improve individual collection retrieval effectiveness from a few to 10%. Mter documents are processed, we invoke zipfs law to remove low (=4) and high frequency (=16000) words from being used for representation. Low frequency terms lead to too many nodes, and high frequency terms lead to too naany edges, both conssuning valuable memory space. Unfortunately, our high frequency cut-off of 16000 was a bad choice. In fact, we used high = 12000 for routing because at that earlier period, we were short of resources. The effect is that our queries become too short and many useful terms (such as `platform' in query #80, `crimin' for criminal in query #87, etc.) are screened out. We discover that this is a m[OCRerr]or factor in our disappointing results. Later experiments use a high cut-off of 50000. 4.2 Subdocuments As in ThECi we segment documents into subdocument units to deal with the problems of WSJ documents having multiple unrelated stories, and long documents in general. A really long document is FR891 19-0111 which has 400748 words. Our criteria is to break documents either on story boundaries or on the next paragraph boundery after a run of 360 words for all collections. We have not found convenient story boundary indicators in other collections as in WSJ (which uses three dashes `---` on a line). With this scheme, the total number of subdocuments from Disk 1&2 becomes 1,281,233 compared with an original 742,611. Mter the ThE[OCRerr] deadiine, we have the resources to investigate the effects of subdocuments on retrieval. Fxperiments were performed on individual subcollections WSJ1, FR2, FRi and AP2, using segmentation sizes of 360, 720 and 1080 words. For WSJ1, we flirther break on story boundaries only. Results are tabulated in Table 1. It appears that for the abnormally long documents of FR, breaking into subdocuments is defmitely worthwhile, achieving improvements of over 20% compared with no segmentation. However, for the newswire documents of AsIs Break on 1080 720 360 "Stories" WSJ1 Avil 0.421 0.432 0.428 0.424 0.418 * of docs 98733 127151 134819 149611 193881 FRi Avil 0.289 0.351 0.354 0.354 * of docs 26207 50055 64650 108374 FR2 Avil 0.333 0.372 0.420 0.421 * of docs 20108 51607 86787 AP2 Avil 0.423 0.423 0.404 0.414 * of docs 84678 85616 95867 146354 Table 1: Document Segmentation Avil Precision using 50 Queries Q2 WSJ1 and AP2, subdocuments have marglnal performance effects: a llttle better for large chunk sizes, and a little worse for small chunks. It seems that a chunk size of about 720-1000 words would get the benefits of both types. Using different chunk sizes for different collections would probably not be worth the effort. We like to point out that other than effectiveness, considerations such as isolating relevant sections for output 236 display and for more precise feedback judgment would also make document segmentation worthwhile. In particular, a number of long docents in feedback for query expansion would easily overload memory space in our network. 4.3 Queries Topics are preprocessed to remove introductory phrases and