NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman tool is a stopword list with 595 entries, some of which are morphological variants of the same word. In addition, we use Porter's stemming [3] algorithm for suffix stripping. There are many other vocabulary control tools such as spell-checker, proper noun and date identification, synonym list, thesaurus, semantic nets, etc. These could be added in the future. 2.3. Multiple Retrieval Methods It is well known that different retrieval algorithms on the same collection would lead to different retrieval results and that combining them could substantially enhance effectiveness [2,4,5,6]. Our network approach to IR (Section 4) can support multiple retrieval methods fairly easily. Three methods of retrieval are used in PIRCS for these experiments, and they agree nicely with the different types of TREC query requirements: (a) Query-Focused Retrieval This method involves the question: given a query CL, should document d[OCRerr] be considered as relevant to it? This point of view is appropriate for ad hoc environment where we have a set of static documents being processed against a query stream. Each query serves as a focus to which documents are ranked. An approximate probabilistic measure W[OCRerr] of this answer for d[OCRerr] is evaluated as follows, and is used as the RSV (retrieval status value) to rank the whole collection with respect to qa: = Xk [OCRerr] (2) w[OCRerr] is the weight of term k in query qa as given in Eqn. 1, and [OCRerr] measures the proportion that term k is used in d[OCRerr]. The sum is over all terms k that overlap between d[OCRerr] and q[OCRerr]. (b) Document-Focused Retrieval This method involves the reverse question: given document d[OCRerr], should query qa be considered as relevant to it? This point of view is appropriate for muting where we have a set of static queries being processed against a document stream. Each document serves as a focus to which queries can be ranked. However, evaluation of retrieval is usually done with respect to a specific query q8, and so an approximate probabilistic measure V[OCRerr] of this answer is given to each document d[OCRerr] and used as RSV as follows: = Xk w[OCRerr]*d[OCRerr]/La. (3) (c) Soft-Boolean Retrieval (Query-Focused) This method depends on a boolean query being available for each topic, which we creat manually for this experiment. The idea is that topics that are represented as a list of terms (as in (a) and (1)) above) have no structure. Boolean queries have structure but its logic is `hard' and we lose the ability to attach term importance or rank the collection. The soft-boolean approach allows one to retain the query structure yet provides weights to both the terms and boolean operators, so that we can soften the logic and provide ranking capability. We have followed the extended Boolean approach of the vector model [7] for this implementation, but with simplified weights. For example, a query CL may be of the form: CL = Bb Ap X, with (4) X = CC V[OCRerr] Dd. 156