SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
DR-LINK: A System Update for TREC-2
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
D. K. Harman
While the Text Structurer module processes documents as described above, the analysis of Topic Statements for their
Text Structure requirements is done by the Natirral Language Query Constructor (Qc) which also analyzes the proper
noun and complex nominal requirements of Topic Statements. The QC, as well as the matching and ranking of
documents using these sources of linguistic information, is described below.
2. B. Subject Field [OCRerr]
The Subject Field Coder (SFC), as reported at ThEC-1, has been producing consistently reliable semantic vectors to
represent both documents and Topic Statements using the semantic codes assigned to word senses in a machine-
readable dictionary. Details regarding this process are reported in detail in Liddy et al, 1993. Our more recent efforts
on this module have focused on multiple ways to exploit SFC-based similarity values between document and query
vectors. One implementation is the use of the ranked vector-sinlilarity values for predicting a cut-off criterion of
potentially relevant documents when the module is used as an initial filter. This is to replace the earlier practice used
in the eighteenth month TJPSTER testing, where documents were ranked by their SFC-vector similarity to a query
SFC-vector and the top two thousand documents were passed to the CG Matcher, since CG matching is too
coinputationally expensive to handie all documents in the collection. To report the SFC's performance, at that time
we reported how far down the ranked list of documents the system would need to process documents in order to get
all the judged relevant documents. Although the results were highly promising (all relevant documents were, on
average, m the top 37% of the ranked list based on SFC similarity values), this figure varies considerably for
individual Topic Statements. Therefore, we needed to devise a method for predicting a priori for individual Topic
Statements, the cut-off criterion for any desired level of recall. We first developed a method that could successfully
predict a cut-off criterion based on just SFC similarity values. We then extended the algorithm to incorporate the
similarity values produced when proper noun, coniplex nominal, and text structure requirements are considered as
well, to produce an integrated ranking based on these varied sources of linguistic information.
The SFC-based cut-off criterion uses a multiple regression formula which was developed on the odd-numbered Topic
Statements from 1 to 50 and a training corpus of Wall Street Journal articles. The regression formula takes into
account the distribution of similarity values for documents in response to a particular query by incorporating the
mean and standard deviation of the similarity value distribution, the similarity of the top-ranked document, and the
desired recall level. The cut-off criterion was tested on the held-out, twenty-five Topic Statements. The averaged
results, when a user is striving for 100% recall, showed that only 39.65 % of the 173,255 documents would need to
be processed flirther. And this document set, in fact, contained 92% of the judged-relevant documents.
The advantage of the cut-off criterion is it's sensitivity to the varied distributions of SFC similarity values for
individual Topic Statements, which appears to reflect how "appropriate" a Topic Statement is for a particular
database. For many queries, a relatively small portion of the database, when ranked by similarity to the Topic
Statement, will need to be further processed. For example, for Topic Statement forty-two, when the goal is 100%
recall, the regression formula predicts a cut-off criterion similarity value which requires that only 13% of the ranked
output be flirther processed, and the avallable relevance judgments show that this pool of documents contains 99% of
the documents judged relevant for that query.
2. C. V-8 Matchin[OCRerr]
Given the complete modularity of the first four modules in the system, for the twenty-four month TIPSThR testing,
we reordered two modules so that Text Structuting is done prior to Subject Field Coding. This allowed us to
implement and test a new version of matching which combines in a unique way the Text Structurer and the Subject
Field Coder. We refer to this version as the V-S model, since eight SFC vectors are produced for each document, one
for each of the seven meta[OCRerr]ategories, plus one for all of the categories combined. The V-S model, therefore, provides
multiple SFC vectors for each document, thereby representing the distribution of SFCs over the various meta-text
coinponents that occur in a news-text document. This means, in the V-S matching, that if certain content areas of the
Topic Statement are required to occur in a document inone meta-text component, e.g. CONSEQUENCE, and other
content is required to occur in another meta-text component, e.g. FUTURE, this proportional division can be
matched against the V-S vectors produced for each document at a fairly abstract, subject level. For the TIPSThR
twenty-four month evaluation, we have experimented with several formulas for combining the similarity values of
88