NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) TREC-2 Routing and Ad-Hoc Retrieval Evaluation using the INQUERY System chapter W. Croft J. Callan J. Broglio National Institute of Standards and Technology D. K. Harman 3.1.2 Constraint capture All text in the query is searched for constraint expressions. Among these expressions are the words company, not U. S. or a restriction in the nationality section of the <fac> field to U.S. or other nationality. A restriction to U.S. nationality as the area of interest is implemented by penalizing documents for references to foreign countries. A restriction to other nationalities is implemented by repeating that country as a term. This asymmetry depends on the fact that the document collection is drawn solely from U.S. sources, and therefore the U.S., as the default area of interest, is rarely referred to unless the government or foreign policy implementation is under discussion. There is some recognition of simple time expressions, such as since 1984 which are expanded to the set of years which might be intended by the phrase in question. Countries are recognized as such and are handled so that expressions like South Africa are phrased as #1 ( south africa ) even when they appear in the middle of a larger group of capitalized words. In addition, proper names such as country names are moved out of the scope of *PHRASE operators, since it generally increases the effectiveness of a #PHRASE to reduce the number of words in it. Nationality constraints can better be maintained within the scope of the larger and more tolerant *SUM operator. For example the phrase import ban on South African diamonds'' becomes by stages, #PHRASE (import ban on #SYN (#1 (south african) #1 (south africa)) diamonds) and finally #SUM (*SYN (#1(south african) *1(south africa)) #PHRASE(import ban on diamonds)). 3.2 Key concept query processing Key concept query processing is different from prose query processing since the concept separation provided by the user can presumably be trusted. Instead of using a part-of- speech tagger, we rely on comma delimitation of concepts, and #PHRASE the words found between each pair of delimiters. Additionally, if any constraints were found anywhere else in the query, e.g., a mention of the word company or an exclusionary geographical constraint (e.g., not USA or only USA), the query will be modified according to these constraints. For example, only USA [OCRerr] #NOT (#FOREIGNCOUNTRY ) and not USA [OCRerr] #NOT ( #USA ). If the word company is found in a query, then a second copy of the key concepts (the <con> field), is produced where each item in the field appears in an unordered window operator with the special concept #COMPANY. For example, if the word South Africa appears as a key concept (and company appears somewhere in the query), then the pre- processor would produce the term #UW5O ( #C[OCRerr]MPANY #1 ( south africa)) which would match any document which had a company name within fifty words of South Africa. 79