NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) DR-LINK: A System Update for TREC-2 chapter E. Liddy S. Myaeng National Institute of Standards and Technology D. K. Harman The QC sublanguage grammar relies on fluiction words (e.g. conjunctions, prepositions, relative pronouns), meta- level phrases (e.g. "such as", 9'examples of', "as well as"), and punctuation (e.g. commas, semi-colons) to recognize and extract the relevancy requirements of Topic Statements. These linguistic features serve as clues to the `organizing' structure of a Topic Statement and present each Topic Statement's unique thematic content in a recognizable frame. The QC sublanguage interprets a Topic Statement into pattern-action rules which are used to reduce each sentence in a Topic Statement into a first order logic assertion, reflecting the boolean-like requirements of Topic Statements, including NOT'd assertions. In addition, defmite noun phrase anaphors are recognized and resolved by sublanguage grammar processing rules. 2.G. Inte[OCRerr]tedMatcji[OCRerr] Each logical assertion produced by the QC for a Topic Statement is evaluated against the entries in the document inverted file and a weight is assigned to each segment of text (either a clause or a sentence) which has any similarity. The weighting scheme we are currently using evolved from iterative testing. Each segment of text is indexed in the inverted file with a text structure component label and will be assigned a weight if it contains any proper nouns or cornplex nominals that match the Topic Statement's requirements. The following weights are assigned: proper noun - complex nommal - proper noun category = 1.00 1.00 0.50 This means, for example, that if, in response to the following requirement from a Topic Statement: A relevant document will provide data on Japanese laws, regulations, andlor practices which help the foreigner understand how Japan controls, or does not control, stock- market practices which could be labeled as insider trading. a document text-segment contains `Japanese law', and `stock-market practice' (or one of its synonymous phrases), and `insider trading' (or one of its synonymous phrases), that segment is assigned a prelimiary value of 3.00. Depending on which field in the Topic Statement the assertion came from, and whether the document text-segment matches the Topic Statement's Text Structure requirement, the preliminary value will be multiplied by one of the following co- eflicients: Topic field and required Text Structure component lssssc, Narr, or Concept field and required Text Structure component Topic field and non-required Text Structure component Desc, Narr, or Concept field and non-required Text Structure component = 1.00 = 0.75 = 0.50 = 0.25 So if `Japanese law' and `stock-market practice' and `insider trading' were conceptual requirements frorn a Topic field assertion that also required them to occur in an EVALUATION or LEAD-MAIN text cornponent, and they occurred in a document text segment which has been tagged by the Text Structurer as EVALUATION, the value of 3 would be multiplied by 1; whereas if that assertion came from the Description field in the Topic Statement and the three required phrases occuried in a document text segment labelled CONSEQUENCE by the Text Structurer, the value of 3 would be multiplied by .25. Since the QC interprets each sentence in the Topic, Description, Narrative, and Concept fields in a Topic Statement, multiple, sornetimes overlapping, sometimes repetitive assertions are produced for a single Topic Statement. In the current implementation, each of these Topic Statement assertions is compared to the inverted document file, and the highest similarity value for a single assertion in the document is used as that document's integrated similarity value for that Topic Statement. The similarity value which results from the QC module matching is combined with the SFC similarity value of the document, and an integrated similarity scre for each document is produced. This similarity value can be used in several ways. Firstly, the two similarity values can be used to provide a full ranking of all the documents which 91