SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman 2.1. Use of Document Components and Query Formulation We view each source document not as monolithic, but as constituted of components. Document components are utilized in two ways: for constructing a more restricted context for term weighting, retrieval and feedback, and for defining initial weights to the content terms for representation. These are discussed in the following sub-sections. 2.1.1 Sub-Documents as Document Components A survey of the WSJ files shows that document lengths vary substantially, from a couple of lines to hundreds, with several thousand words. Moreover, many documents carry unrelated news stories, separated by three dashes `---` on an independent line. We believe that treating such documents as monolithic objects will have adverse impact on: a) precision, because they might lead to high probability that homographs would occur in a different sense and context from what one intends; b) term weighting, because unreliable estimates of the necessary probabilities for the index terms might result, and affecting retrieval; c) feedback and query expansion, because documents that are long and have mixed unrelated topics will make these processes imprecise; and d) system output effectiveness, because after retrieval, users still have to manipulate a large document to locate where the relevant passage is. There may be some risks in using document breakups. Boolean AND's would not be satisfied if the two factors for an AND happen to be split up in separate sub-documents. Coordinate matching would get less term counting if one does not somehow combine the counts of each sub-documents for ranking purposes, assumming all match terms are topically relevant. List queries with term weights may or may not suffer: even though a sub-document would have less term match than a flill document, shorter document lengths may lead to higher term weights depending on the weighting method used. Since all documents are treated in the same fashion, the sub-document weights will be affected to the same degree. If we allow for a substantial chunk of text, such as a few hundred words, a writer generally would have a chance to express what s/he intends fairly completely, either in this or in another sub-document. Since we are neither using Boolean retrieval nor coordinate matching, we believe using sub-documents with more uniform lengths outweigh the risks. Our experimental results seem to support our conjecture. Therefore the first processing we did was to break each document into component units of approximately equal lengths. We creat a new component, (which we call a sub-document, with two more digits attached at the end of the original document ID, assumming a breakup of at most 100 sub-documents per document) whenever we recognize the story separation mark `---`, or when we have a run of texts exceeding N words and ending on a paragraph boundary. A sub-document should not be too short lest it carries little content. Moreover, since each sub-document is an independent entry consuming space and time resources, we do not want to exceed a certain limit, which we arbitrary set as double the original number. After some experimentation, we found that a break at N=360 raw words satisfies our design goal. The original number of documents in WSJ is (first half plus second half) 98733+74486 = 173219; after breaking into components, we end up with 192935+156880 = 349815 sub-documents. This forms our database for subsequent processing. We would prefer to break documents based on more sophisticated strategies such as context, but we have not done so. 2.12 Query Formulation In PIRCS without soft-boolean, queries and documents are items of the same category, each containing a list of content terms. For a query, we obtain the list from the <title>, <desc>, <narr> and <con> paragraphs of the topic in a flilly automatic fashion. These paragraphs also go through a filtering program that removes standard introductory phrases such as: `To be relevant, a document (will I must I..) (discuss 154