NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman While the Text Structurer module processes documents as described above, the analysis of Topic Statements for their Text Structure requirements is done by the Natirral Language Query Constructor (Qc) which also analyzes the proper noun and complex nominal requirements of Topic Statements. The QC, as well as the matching and ranking of documents using these sources of linguistic information, is described below. 2. B. Subject Field [OCRerr] The Subject Field Coder (SFC), as reported at ThEC-1, has been producing consistently reliable semantic vectors to represent both documents and Topic Statements using the semantic codes assigned to word senses in a machine- readable dictionary. Details regarding this process are reported in detail in Liddy et al, 1993. Our more recent efforts on this module have focused on multiple ways to exploit SFC-based similarity values between document and query vectors. One implementation is the use of the ranked vector-sinlilarity values for predicting a cut-off criterion of potentially relevant documents when the module is used as an initial filter. This is to replace the earlier practice used in the eighteenth month TJPSTER testing, where documents were ranked by their SFC-vector similarity to a query SFC-vector and the top two thousand documents were passed to the CG Matcher, since CG matching is too coinputationally expensive to handie all documents in the collection. To report the SFC's performance, at that time we reported how far down the ranked list of documents the system would need to process documents in order to get all the judged relevant documents. Although the results were highly promising (all relevant documents were, on average, m the top 37% of the ranked list based on SFC similarity values), this figure varies considerably for individual Topic Statements. Therefore, we needed to devise a method for predicting a priori for individual Topic Statements, the cut-off criterion for any desired level of recall. We first developed a method that could successfully predict a cut-off criterion based on just SFC similarity values. We then extended the algorithm to incorporate the similarity values produced when proper noun, coniplex nominal, and text structure requirements are considered as well, to produce an integrated ranking based on these varied sources of linguistic information. The SFC-based cut-off criterion uses a multiple regression formula which was developed on the odd-numbered Topic Statements from 1 to 50 and a training corpus of Wall Street Journal articles. The regression formula takes into account the distribution of similarity values for documents in response to a particular query by incorporating the mean and standard deviation of the similarity value distribution, the similarity of the top-ranked document, and the desired recall level. The cut-off criterion was tested on the held-out, twenty-five Topic Statements. The averaged results, when a user is striving for 100% recall, showed that only 39.65 % of the 173,255 documents would need to be processed flirther. And this document set, in fact, contained 92% of the judged-relevant documents. The advantage of the cut-off criterion is it's sensitivity to the varied distributions of SFC similarity values for individual Topic Statements, which appears to reflect how "appropriate" a Topic Statement is for a particular database. For many queries, a relatively small portion of the database, when ranked by similarity to the Topic Statement, will need to be further processed. For example, for Topic Statement forty-two, when the goal is 100% recall, the regression formula predicts a cut-off criterion similarity value which requires that only 13% of the ranked output be flirther processed, and the avallable relevance judgments show that this pool of documents contains 99% of the documents judged relevant for that query. 2. C. V-8 Matchin[OCRerr] Given the complete modularity of the first four modules in the system, for the twenty-four month TIPSThR testing, we reordered two modules so that Text Structuting is done prior to Subject Field Coding. This allowed us to implement and test a new version of matching which combines in a unique way the Text Structurer and the Subject Field Coder. We refer to this version as the V-S model, since eight SFC vectors are produced for each document, one for each of the seven meta[OCRerr]ategories, plus one for all of the categories combined. The V-S model, therefore, provides multiple SFC vectors for each document, thereby representing the distribution of SFCs over the various meta-text coinponents that occur in a news-text document. This means, in the V-S matching, that if certain content areas of the Topic Statement are required to occur in a document inone meta-text component, e.g. CONSEQUENCE, and other content is required to occur in another meta-text component, e.g. FUTURE, this proportional division can be matched against the V-S vectors produced for each document at a fairly abstract, subject level. For the TIPSThR twenty-four month evaluation, we have experimented with several formulas for combining the similarity values of 88