NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman I cite I report I ..)`, etc. If a capital `NOT' is found in a sentence, the rest of the sentence including the `NOT' is also removed, because it is difficult for list queries to handle negation. We understand that this does not completely solve the negation problem because some `not's are not capitalized, and many negations are expressed by other means. The remaining words from the four paragraphs are then merged and processed against the collection dictionary to form a query representation. No breakup into sub- documents is done for topics. We also manually form a boolean query for each topic for soft-boolean retrieval, thus providing both an alternative representation for queries as well as a different retrieval method. We essentially scan the same paragrapbs as before. Sometimes we also consult the document frequency of a term to screen out high frequency terms, to arrive at a smaller expression. We might occasionally add to the boolean expression some new terms that are not in the original paragraphs. However, the way our evaluation program works is that these new terms are ignored because they are not part of the automatically formed query. 2.1[OCRerr] Initial Term Weighting based on Single Terms as Conceptual Components After the previous processes, we apply the use of document components a second time. We regard each content term within a sub-document or query as an independent concept. This allows us to use the principle of document self-recovery to give initial weights to each term of an item, or to use the simpler Inverse Collection Term Frequency (ICTF) weighting [1,2]. Because the former requires experimentally adjusting some paramenters and we did not have sufficient relevance judgment information, we decided to use the simplier ICTF for our initial weighting of a term in an item (query or document) as follows: = In [p/(l-p)] + In [(l-s[OCRerr][OCRerr]/s[OCRerr]. (1) w[OCRerr] is the weight given to term k in item i; p = 1/50, a constant chosen based on previous experience; and 5jk = (Fk-dik)/(NW-LI) if item i is a document, and 5jk = Fk/NW if item i is a query. Here Fk = Xi d[OCRerr]k is the collection frequency of term k, [OCRerr] is the term frequency of term k in item i, L1 = xk d[OCRerr] is the length of item i, and N[OCRerr] = Xi = xk Fk is the total number of terms in the database. The fraction (1[OCRerr]5ik)/5ik inside the logarithm ln in Eqn. 1 is approximately N[OCRerr]Fk, if N[OCRerr]ť Fkť d[OCRerr], hence the nomenclature ICTF. 2.2. Two-word Phrases and Other Vocabulary Control Document frequencies (of terms) of a few hundred are small compared with a few hundred thousand documents. Yet a few hundred high-ranked documents that are irrelevant would stretch a user's patience to great limits. We therefore believe that in large collection environments, precision enhancement tools are very important. Syntactic phrases, or statistically generated phrases within tight context would probably be useflil as indexing terms. However, we do not have these tools yet for the experiments. A look at the collection also shows that WSJ jargon contains many two-word phrases that are a combination of very common words. Examples are `big three', `buy down', `drug money', `go public', `take over', etc. By themselves, many of the single terms would be screened out because they are either on the stopword list, or have high document frequencies. In combination however these two-word phrases are precise in meaning and would probably impact favorably on both precision and recall. We therefore spent a fair amount of manual effort to record these content-specific combinations in a two-word phrase file. During processing, whenever such two-word phrases occur in adjacent positions in a sentence, a new index term is created consisting of the combination, in addition to the single terms. Such two-word combinations are also common in other fields. Our file has 396 such pairs, containing also some from the computing field, and not necessarily just consisting of function or common words. We would like to have a larger set, but we did not have the resource nor the domain expertise. Another vocabulary control 155