SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
DR-LINK: A System Update for TREC-2
chapter
E. Liddy
S. Myaeng
National Institute of Standards and Technology
D. K. Harman
The QC sublanguage grammar relies on fluiction words (e.g. conjunctions, prepositions, relative pronouns), meta-
level phrases (e.g. "such as", 9'examples of', "as well as"), and punctuation (e.g. commas, semi-colons) to recognize
and extract the relevancy requirements of Topic Statements. These linguistic features serve as clues to the
`organizing' structure of a Topic Statement and present each Topic Statement's unique thematic content in a
recognizable frame. The QC sublanguage interprets a Topic Statement into pattern-action rules which are used to
reduce each sentence in a Topic Statement into a first order logic assertion, reflecting the boolean-like requirements
of Topic Statements, including NOT'd assertions. In addition, defmite noun phrase anaphors are recognized and
resolved by sublanguage grammar processing rules.
2.G. Inte[OCRerr]tedMatcji[OCRerr]
Each logical assertion produced by the QC for a Topic Statement is evaluated against the entries in the document
inverted file and a weight is assigned to each segment of text (either a clause or a sentence) which has any similarity.
The weighting scheme we are currently using evolved from iterative testing. Each segment of text is indexed in the
inverted file with a text structure component label and will be assigned a weight if it contains any proper nouns or
cornplex nominals that match the Topic Statement's requirements. The following weights are assigned:
proper noun -
complex nommal -
proper noun category =
1.00
1.00
0.50
This means, for example, that if, in response to the following requirement from a Topic Statement:
A relevant document will provide data on Japanese laws, regulations, andlor practices
which help the foreigner understand how Japan controls, or does not control, stock-
market practices which could be labeled as insider trading.
a document text-segment contains `Japanese law', and `stock-market practice' (or one of its synonymous phrases), and
`insider trading' (or one of its synonymous phrases), that segment is assigned a prelimiary value of 3.00. Depending
on which field in the Topic Statement the assertion came from, and whether the document text-segment matches the
Topic Statement's Text Structure requirement, the preliminary value will be multiplied by one of the following co-
eflicients:
Topic field and required Text Structure component
lssssc, Narr, or Concept field and required Text Structure component
Topic field and non-required Text Structure component
Desc, Narr, or Concept field and non-required Text Structure component
= 1.00
= 0.75
= 0.50
= 0.25
So if `Japanese law' and `stock-market practice' and `insider trading' were conceptual requirements frorn a Topic field
assertion that also required them to occur in an EVALUATION or LEAD-MAIN text cornponent, and they occurred
in a document text segment which has been tagged by the Text Structurer as EVALUATION, the value of 3 would
be multiplied by 1; whereas if that assertion came from the Description field in the Topic Statement and the three
required phrases occuried in a document text segment labelled CONSEQUENCE by the Text Structurer, the value of
3 would be multiplied by .25.
Since the QC interprets each sentence in the Topic, Description, Narrative, and Concept fields in a Topic Statement,
multiple, sornetimes overlapping, sometimes repetitive assertions are produced for a single Topic Statement. In the
current implementation, each of these Topic Statement assertions is compared to the inverted document file, and the
highest similarity value for a single assertion in the document is used as that document's integrated similarity value
for that Topic Statement.
The similarity value which results from the QC module matching is combined with the SFC similarity value of the
document, and an integrated similarity scre for each document is produced. This similarity value can be used in
several ways. Firstly, the two similarity values can be used to provide a full ranking of all the documents which
91