SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-2 Document Retrieval Experiments using PIRCS
chapter
K. Kwok
L. Grunfeld
National Institute of Standards and Technology
D. K. Harman
some obvious non-content phrases (such as `author describ',
paper attempt', `two way'). The result is 13,787 pairs.
These plus the previously prepared manually list (which
contalus mostly of phrases with at least ore stopword as
well as some phrases identified in the `topics') are then
treated as if they were single index terms to be identified
during document and query processing [OCRerr]US[OCRerr]3J. They
have their own global and local usage statistics, and can
improve individual collection retrieval effectiveness from a
few to 10%.
Mter documents are processed, we invoke zipfs law to
remove low (=4) and high frequency (=16000) words from
being used for representation. Low frequency terms lead to
too many nodes, and high frequency terms lead to too naany
edges, both conssuning valuable memory space.
Unfortunately, our high frequency cut-off of 16000 was a
bad choice. In fact, we used high = 12000 for routing
because at that earlier period, we were short of resources.
The effect is that our queries become too short and many
useful terms (such as `platform' in query #80, `crimin' for
criminal in query #87, etc.) are screened out. We discover
that this is a m[OCRerr]or factor in our disappointing results. Later
experiments use a high cut-off of 50000.
4.2 Subdocuments
As in ThECi we segment documents into subdocument
units to deal with the problems of WSJ documents having
multiple unrelated stories, and long documents in general.
A really long document is FR891 19-0111 which has 400748
words. Our criteria is to break documents either on story
boundaries or on the next paragraph boundery after a run of
360 words for all collections. We have not found
convenient story boundary indicators in other collections as
in WSJ (which uses three dashes `---` on a line). With this
scheme, the total number of subdocuments from Disk 1&2
becomes 1,281,233 compared with an original 742,611.
Mter the ThE[OCRerr] deadiine, we have the resources to
investigate the effects of subdocuments on retrieval.
Fxperiments were performed on individual subcollections
WSJ1, FR2, FRi and AP2, using segmentation sizes of 360,
720 and 1080 words. For WSJ1, we flirther break on story
boundaries only. Results are tabulated in Table 1. It
appears that for the abnormally long documents of FR,
breaking into subdocuments is defmitely worthwhile,
achieving improvements of over 20% compared with no
segmentation. However, for the newswire documents of
AsIs Break on 1080 720 360
"Stories"
WSJ1
Avil 0.421 0.432 0.428 0.424 0.418
* of docs 98733 127151 134819 149611 193881
FRi
Avil 0.289 0.351 0.354 0.354
* of docs 26207 50055 64650 108374
FR2
Avil 0.333 0.372 0.420 0.421
* of docs 20108 51607 86787
AP2
Avil 0.423 0.423 0.404 0.414
* of docs 84678 85616 95867 146354
Table 1: Document Segmentation Avil Precision using 50 Queries Q2
WSJ1 and AP2, subdocuments have marglnal performance
effects: a llttle better for large chunk sizes, and a little
worse for small chunks. It seems that a chunk size of about
720-1000 words would get the benefits of both types. Using
different chunk sizes for different collections would
probably not be worth the effort.
We like to point out that other than effectiveness,
considerations such as isolating relevant sections for output
236
display and for more precise feedback judgment would also
make document segmentation worthwhile. In particular, a
number of long docents in feedback for query expansion
would easily overload memory space in our network.
4.3 Queries
Topics are preprocessed to remove introductory phrases and