SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Retrieval Experiments with a Large Collection using PIRCS
chapter
K. Kwok
L. Papadopoulos
K. Kwan
National Institute of Standards and Technology
Donna K. Harman
I cite I report I ..)`, etc. If a capital `NOT' is found in a sentence, the rest of the sentence including the
`NOT' is also removed, because it is difficult for list queries to handle negation. We understand that this
does not completely solve the negation problem because some `not's are not capitalized, and many
negations are expressed by other means. The remaining words from the four paragraphs are then merged
and processed against the collection dictionary to form a query representation. No breakup into sub-
documents is done for topics.
We also manually form a boolean query for each topic for soft-boolean retrieval, thus providing both an
alternative representation for queries as well as a different retrieval method. We essentially scan the same
paragrapbs as before. Sometimes we also consult the document frequency of a term to screen out high
frequency terms, to arrive at a smaller expression. We might occasionally add to the boolean expression
some new terms that are not in the original paragraphs. However, the way our evaluation program works
is that these new terms are ignored because they are not part of the automatically formed query.
2.1[OCRerr] Initial Term Weighting based on Single Terms as Conceptual Components
After the previous processes, we apply the use of document components a second time. We regard each
content term within a sub-document or query as an independent concept. This allows us to use the
principle of document self-recovery to give initial weights to each term of an item, or to use the simpler
Inverse Collection Term Frequency (ICTF) weighting [1,2]. Because the former requires experimentally
adjusting some paramenters and we did not have sufficient relevance judgment information, we decided
to use the simplier ICTF for our initial weighting of a term in an item (query or document) as follows:
= In [p/(l-p)] + In [(l-s[OCRerr][OCRerr]/s[OCRerr]. (1)
w[OCRerr] is the weight given to term k in item i; p = 1/50, a constant chosen based on previous experience; and
5jk = (Fk-dik)/(NW-LI) if item i is a document, and 5jk = Fk/NW if item i is a query. Here Fk = Xi d[OCRerr]k is the
collection frequency of term k, [OCRerr] is the term frequency of term k in item i, L1 = xk d[OCRerr] is the length of
item i, and N[OCRerr] = Xi = xk Fk is the total number of terms in the database. The fraction (1[OCRerr]5ik)/5ik inside
the logarithm ln in Eqn. 1 is approximately N[OCRerr]Fk, if N[OCRerr]» Fk» d[OCRerr], hence the nomenclature ICTF.
2.2. Two-word Phrases and Other Vocabulary Control
Document frequencies (of terms) of a few hundred are small compared with a few hundred thousand
documents. Yet a few hundred high-ranked documents that are irrelevant would stretch a user's patience
to great limits. We therefore believe that in large collection environments, precision enhancement tools
are very important. Syntactic phrases, or statistically generated phrases within tight context would
probably be useflil as indexing terms. However, we do not have these tools yet for the experiments.
A look at the collection also shows that WSJ jargon contains many two-word phrases that are a
combination of very common words. Examples are `big three', `buy down', `drug money', `go public',
`take over', etc. By themselves, many of the single terms would be screened out because they are either
on the stopword list, or have high document frequencies. In combination however these two-word phrases
are precise in meaning and would probably impact favorably on both precision and recall. We therefore
spent a fair amount of manual effort to record these content-specific combinations in a two-word phrase
file. During processing, whenever such two-word phrases occur in adjacent positions in a sentence, a new
index term is created consisting of the combination, in addition to the single terms. Such two-word
combinations are also common in other fields. Our file has 396 such pairs, containing also some from
the computing field, and not necessarily just consisting of function or common words. We would like to
have a larger set, but we did not have the resource nor the domain expertise. Another vocabulary control
155