SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Retrieval Experiments with a Large Collection using PIRCS
chapter
K. Kwok
L. Papadopoulos
K. Kwan
National Institute of Standards and Technology
Donna K. Harman
tool is a stopword list with 595 entries, some of which are morphological variants of the same word. In
addition, we use Porter's stemming [3] algorithm for suffix stripping.
There are many other vocabulary control tools such as spell-checker, proper noun and date identification,
synonym list, thesaurus, semantic nets, etc. These could be added in the future.
2.3. Multiple Retrieval Methods
It is well known that different retrieval algorithms on the same collection would lead to different retrieval
results and that combining them could substantially enhance effectiveness [2,4,5,6]. Our network approach
to IR (Section 4) can support multiple retrieval methods fairly easily. Three methods of retrieval are used
in PIRCS for these experiments, and they agree nicely with the different types of TREC query
requirements:
(a) Query-Focused Retrieval
This method involves the question: given a query CL, should document d[OCRerr] be considered as relevant to it?
This point of view is appropriate for ad hoc environment where we have a set of static documents being
processed against a query stream. Each query serves as a focus to which documents are ranked. An
approximate probabilistic measure W[OCRerr] of this answer for d[OCRerr] is evaluated as follows, and is used as the RSV
(retrieval status value) to rank the whole collection with respect to qa:
= Xk [OCRerr] (2)
w[OCRerr] is the weight of term k in query qa as given in Eqn. 1, and [OCRerr] measures the proportion that term k
is used in d[OCRerr]. The sum is over all terms k that overlap between d[OCRerr] and q[OCRerr].
(b) Document-Focused Retrieval
This method involves the reverse question: given document d[OCRerr], should query qa be considered as relevant
to it? This point of view is appropriate for muting where we have a set of static queries being processed
against a document stream. Each document serves as a focus to which queries can be ranked. However,
evaluation of retrieval is usually done with respect to a specific query q8, and so an approximate
probabilistic measure V[OCRerr] of this answer is given to each document d[OCRerr] and used as RSV as follows:
= Xk w[OCRerr]*d[OCRerr]/La.
(3)
(c) Soft-Boolean Retrieval (Query-Focused)
This method depends on a boolean query being available for each topic, which we creat manually for this
experiment. The idea is that topics that are represented as a list of terms (as in (a) and (1)) above) have
no structure. Boolean queries have structure but its logic is `hard' and we lose the ability to attach term
importance or rank the collection. The soft-boolean approach allows one to retain the query structure yet
provides weights to both the terms and boolean operators, so that we can soften the logic and provide
ranking capability. We have followed the extended Boolean approach of the vector model [7] for this
implementation, but with simplified weights. For example, a query CL may be of the form:
CL = Bb Ap X,
with (4)
X = CC V[OCRerr] Dd.
156