SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
Since the SLR methodology is hospitable to the introduction of additional clue-
types, and indeed might be expected to wring a maximum amount of leverage out of
them, the prospect for future, less primitive, SLR systems that combine many types of
evidence seems promising.
Future Possibilities
Though the prototype system described here used only a few simple statistical
clues, the SLR approach is general and in principle flexible enough to accommodate most
of the clue-types that researchers have been interested in as predictors of relevance.
Broadly speaking, retrieval evidence having to do with particular index terms lends itself
to exploitation in the form of variables in the first-stage regression equation, while other
kinds -- properties of the entire query, entire document, or their relationship -- can be
accommodated as variables in the second stage.
As an example of possible new evidence at the first stage, suppose by virtue of
parsing, suffix analysis, or dictionary lookup some information is available about the
parts of speech of the match stems in the query and document. Then an additional cate-
gorical variable might be introduced into the first-level regression analysis to represent
the match stem's part of speech in the document, on the hunch that some parts of speech
(e.g. nouns) should be more heavily weighted than others. The general two-level form of
the analysis would remain the same. A further possibility would be to introduce a vari-
able to represent the event that the part of speech of the stem as it occurs in the query is
the same as its part of speech in the context in which it occurs in the document.
Further clues could be introduced at the second stage. In the present experiment
the only retrieval evidence introduced at the second stage that was not already present in
the first was the document length L, which was intended more as an antidote to a bias in
Z than as an independent predictor of relevance in its own right. But nothing prevents
any helpful relationship between the query and document from being brought to bear. As
an example, suppose a measure of the mutual closeness of the query's match stems in the
document is to be introduced on the hypothesis that the closer together the query stems
tend to occur in the document, the likelier it is (other things being equal) that the docu-
ment is relevant (Keen 1992). Such a measure of proximity could be added as a new
variable in the second-level equation, with no other change being needed in the underly-
ing statistical framework.
Conclusions
The TREC results indicate that the SLR methodology is capable of achieving a
respectable degree of retrieval effectiveness even when the retrieval evidence is
confined to a few simple frequency clues. (`Respectable' in this context means
competitive with the median performance of other systems most of which use more
elaborate evidence.) Since nothing prevents the incorporation of additional clue
types into future SLR systems, and the regression procedure should help to com-
bine them with existing clues in an optimal way, the outlook for the retrieval effec-
tiveness of the SLR approach seems promising.
2. The prototype SLR system demonstrates that a probabilistic initial ranking can be
achieved with a run-time efficiency approximately equivalent to that of a vector
85