SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) WORDIJ: A Word Pair Approach to Information Retrieval chapter J. Danowski National Institute of Standards and Technology Donna K. Harman Table 2: Query style & performance correlations Diff. Failure Sentences - .1053 -.1102 Words -.1139 -.1616 Words/sent. .0736 -.0004 Long words -.1574 -.1086 Personal words .0846 .0046 Action words .1192 .2690 Syllables/word -.1055 .0763 Reading grade level -.0256 .0629 Additional failure analysis was conducted to explore whether there were particular words associated with performance. The frequencies of all words (no stop words) for each query were correlated with both types of performance criteria: 1) continuous difference from the median and 2) failure, indicated by results significantly below median performance. Table 3 presents the correlations that were significant at the .01 alpha level or better across the 98 topics, and which occurred in at least five different topics. The words `to' and `some' increased in frequency as performance increased, while frequency of the following words was associated with lower performance: `who, more, type, following, been, two.' For the failure criterion, `who, more, been, two' were also significantly associated with lower performance. In addition, `national, system, support' were also negatively associated with it. This analysis of words from queries associated with performance suggests that the pair matching approach worked best when the documents used a domain-specific vocabulary. Proper Name Identification. At the other extreme, topics that used more domain- general words had lower performance. In particular, queries that asked for a category of documents, such as indicated by words such as `who' and `type' were more likely in the failure category. Words including: `system, national, following, been, and two' were also associated with higher failure rates. This suggests that proper noun compounds may require special treatment. The names of organizations, products, locations, etc. cannot apparently be easily identified through direct pair matching when these specific proper nouns are not contained in the query. When such specific results are called for by a query, special procedures are probably desirable for identification of proper nouns in documents that match on other query pairs. Domain Specificity of Words. Table 3: Query words & performance correlations WORD r No. of Topics Difference to .2743* 15 some .2480* 10 who 3570** 8 more .2509* 8 type 3740** 6 following .3069* 6 been .2580* 6 two 3750** 5 Failure national .2479* 11 system .2479* 9 who 3828** 8 more .2426* 8 been .2545* 6 two 4100** 5 support .2479* 5 * p < .011 ** p < .001 An additional implication is that query expansion may be fruitful when dealing with domain-transcendent words. Through use of thesauri or databases such as WordNet, alternative word meaning senses may be disambiguated. Then synonyms specific to the proper domain could be added to the actual query pairs contained in the original raw query text. Interestingly, queries that contained the words `some' resulted in higher performance. This may suggest that the criteria for relevance were less stringent for such queries, in that they asked not for an exhaustive and complete fit of query to documents, but a more partial overlap. The word `to' in queries was also associated with higher performance. This may be associated with the specificity of this word in discourse, indicating relationships of direction, degree, state, contact, possession, etc. Natural Language Processing on Quedes. Together, such query-focused results suggest that future work may benefit from performing complex natural language processing such as parsing, sense disambiguation, etc. on the queries themselves to tune them before matching. Sophisticated treatment of queries may improve performance to the point that such treatment of the raw texts themselves, which is expensive, may not add much marginal performance improvemenL 133