SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression
chapter
W. Cooper
F. Grey
A. Chen
National Institute of Standards and Technology
Donna K. Harman
This split level approach allows for a natural separation of the available retrieval
clues into two kinds. Statistical inferences based on properties of particular terms may be
drawn first, while other kinds of evidence not confined to particular terms (e.g. document
length, citedness, etc.) are saved for the second stage of statistical inference. An impor-
tant virtue of this split-level approach is that the second-stage regression tends to correct
for biases introduced by the statistical simplifying assumptions used to consolidate the
results of the first stage.
A `Bare-Bones' SLR Methodology
Because the sole focus of interest in the experiment was the logic of the SLR
approach, it seemed appropriate to keep all other design complications to a minimum.
Except for the capacity to perform the two-level probabilistic computations requisite for
SLR, therefore, the experimental system was kept as simple and automatic as possible.
Thus no phrase discovery, part[OCRerr]of-speech tagging, disambiguation, or other linguistically
sophisticated operations were incorporated, nor was a thesaurus included for the confla-
tion of synonyms or other purposes, nor was the descriptor vocabulary structured in any
way. There was no clustering, no knowledge base, no set of implicative rules, no net-
work, nor anything else `Al-like.' All indexing was performed extractively without bene-
fit of human intervention. No use was made of the manually assigned descriptors in the
document collections that had them.
The experimental retrieval system was implemented by modifying the SMART
system (Version 10), a suite of IR programs generously provided to the IR research com-
munity by researchers at Cornell University. Since the new model to be implemented
was probabilistic, all features of SMART motivated by the vector space retrieval model
were left unused or replaced by corresponding probabilistic operations. The SMART
stop list was used as-is except for the addition of a few common words from the query
vocabulary thought unlikely to be content-indicative (e.g. `document', `contains'). The
SMART system's stemmer was used without modification to strip suffixes and endings
off of all query and document words that survived the stop list. The search algorithm
used a slightly extended form of SMART's inverted file. The retrieval speed and general
programming complexity of the SMART system were left substantially unaffected by this
conversion to SLR-based retrieval, which meant that the objective of run-time efficiency
and complexity roughly comparable to that of a vector processing system had been
achieved.
The Design Equations
At the heart of the design are four statistical equations. Taken together they are
capable of supplying an estimate, for any query[OCRerr]document pair of interest, of the proba-
bility that the document in question is relevant to the query in question. We shall take
them up in their natural order of application.
Some preliminary vocabulary: A `logodds' may be regarded simply as a probabil-
P(E)
ity re-expressed on a special scale. The odds 0(E) of an event E is by definition
The conditional odds 0(E11E2) is P(LiI E2) The `logodds' of E, sometimes also
P(E11E2)
76