SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Retrieval Experiments with a Large Collection using PIRCS
chapter
K. Kwok
L. Papadopoulos
K. Kwan
National Institute of Standards and Technology
Donna K. Harman
2.1. Use of Document Components and Query Formulation
We view each source document not as monolithic, but as constituted of components. Document
components are utilized in two ways: for constructing a more restricted context for term weighting,
retrieval and feedback, and for defining initial weights to the content terms for representation. These are
discussed in the following sub-sections.
2.1.1 Sub-Documents as Document Components
A survey of the WSJ files shows that document lengths vary substantially, from a couple of lines to
hundreds, with several thousand words. Moreover, many documents carry unrelated news stories,
separated by three dashes `---` on an independent line. We believe that treating such documents as
monolithic objects will have adverse impact on: a) precision, because they might lead to high probability
that homographs would occur in a different sense and context from what one intends; b) term weighting,
because unreliable estimates of the necessary probabilities for the index terms might result, and affecting
retrieval; c) feedback and query expansion, because documents that are long and have mixed unrelated
topics will make these processes imprecise; and d) system output effectiveness, because after retrieval,
users still have to manipulate a large document to locate where the relevant passage is.
There may be some risks in using document breakups. Boolean AND's would not be satisfied if the two
factors for an AND happen to be split up in separate sub-documents. Coordinate matching would get less
term counting if one does not somehow combine the counts of each sub-documents for ranking purposes,
assumming all match terms are topically relevant. List queries with term weights may or may not suffer:
even though a sub-document would have less term match than a flill document, shorter document lengths
may lead to higher term weights depending on the weighting method used. Since all documents are
treated in the same fashion, the sub-document weights will be affected to the same degree. If we allow
for a substantial chunk of text, such as a few hundred words, a writer generally would have a chance to
express what s/he intends fairly completely, either in this or in another sub-document. Since we are
neither using Boolean retrieval nor coordinate matching, we believe using sub-documents with more
uniform lengths outweigh the risks. Our experimental results seem to support our conjecture.
Therefore the first processing we did was to break each document into component units of approximately
equal lengths. We creat a new component, (which we call a sub-document, with two more digits attached
at the end of the original document ID, assumming a breakup of at most 100 sub-documents per
document) whenever we recognize the story separation mark `---`, or when we have a run of texts
exceeding N words and ending on a paragraph boundary. A sub-document should not be too short lest
it carries little content. Moreover, since each sub-document is an independent entry consuming space and
time resources, we do not want to exceed a certain limit, which we arbitrary set as double the original
number. After some experimentation, we found that a break at N=360 raw words satisfies our design goal.
The original number of documents in WSJ is (first half plus second half) 98733+74486 = 173219; after
breaking into components, we end up with 192935+156880 = 349815 sub-documents. This forms our
database for subsequent processing. We would prefer to break documents based on more sophisticated
strategies such as context, but we have not done so.
2.12 Query Formulation
In PIRCS without soft-boolean, queries and documents are items of the same category, each containing
a list of content terms. For a query, we obtain the list from the <title>, <desc>, <narr> and <con>
paragraphs of the topic in a flilly automatic fashion. These paragraphs also go through a filtering program
that removes standard introductory phrases such as: `To be relevant, a document (will I must I..) (discuss
154