SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
TREC-2 Routing and Ad-Hoc Retrieval Evaluation using the INQUERY System
chapter
W. Croft
J. Callan
J. Broglio
National Institute of Standards and Technology
D. K. Harman
3.1.2 Constraint capture
All text in the query is searched for constraint expressions. Among these expressions are
the words company, not U. S. or a restriction in the nationality section of the <fac> field
to U.S. or other nationality. A restriction to U.S. nationality as the area of interest is
implemented by penalizing documents for references to foreign countries. A restriction to
other nationalities is implemented by repeating that country as a term. This asymmetry
depends on the fact that the document collection is drawn solely from U.S. sources, and
therefore the U.S., as the default area of interest, is rarely referred to unless the government
or foreign policy implementation is under discussion.
There is some recognition of simple time expressions, such as since 1984 which are
expanded to the set of years which might be intended by the phrase in question.
Countries are recognized as such and are handled so that expressions like South Africa
are phrased as #1 ( south africa ) even when they appear in the middle of a larger group
of capitalized words. In addition, proper names such as country names are moved out of
the scope of *PHRASE operators, since it generally increases the effectiveness of a #PHRASE to
reduce the number of words in it. Nationality constraints can better be maintained within
the scope of the larger and more tolerant *SUM operator. For example the phrase
import ban on South African diamonds''
becomes by stages,
#PHRASE (import ban on #SYN (#1 (south african) #1 (south africa)) diamonds)
and finally
#SUM (*SYN (#1(south african) *1(south africa))
#PHRASE(import ban on diamonds)).
3.2 Key concept query processing
Key concept query processing is different from prose query processing since the concept
separation provided by the user can presumably be trusted. Instead of using a part-of-
speech tagger, we rely on comma delimitation of concepts, and #PHRASE the words found
between each pair of delimiters.
Additionally, if any constraints were found anywhere else in the query, e.g., a mention of
the word company or an exclusionary geographical constraint (e.g., not USA or only USA),
the query will be modified according to these constraints. For example,
only USA [OCRerr] #NOT (#FOREIGNCOUNTRY )
and
not USA [OCRerr] #NOT ( #USA ).
If the word company is found in a query, then a second copy of the key concepts (the
<con> field), is produced where each item in the field appears in an unordered window
operator with the special concept #COMPANY. For example, if the word South Africa
appears as a key concept (and company appears somewhere in the query), then the pre-
processor would produce the term #UW5O ( #C[OCRerr]MPANY #1 ( south africa)) which would
match any document which had a company name within fifty words of South Africa.
79