SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
WORDIJ: A Word Pair Approach to Information Retrieval
chapter
J. Danowski
National Institute of Standards and Technology
Donna K. Harman
Table 2: Query style & performance correlations
Diff. Failure
Sentences - .1053 -.1102
Words -.1139 -.1616
Words/sent. .0736 -.0004
Long words -.1574 -.1086
Personal words .0846 .0046
Action words .1192 .2690
Syllables/word -.1055 .0763
Reading grade level -.0256 .0629
Additional failure analysis was conducted to explore
whether there were particular words associated with
performance. The frequencies of all words (no stop words)
for each query were correlated with both types of
performance criteria: 1) continuous difference from the
median and 2) failure, indicated by results significantly
below median performance. Table 3 presents the
correlations that were significant at the .01 alpha level or
better across the 98 topics, and which occurred in at least
five different topics.
The words `to' and `some' increased in frequency as
performance increased, while frequency of the following
words was associated with lower performance: `who, more,
type, following, been, two.' For the failure criterion, `who,
more, been, two' were also significantly associated with
lower performance. In addition, `national, system, support'
were also negatively associated with it. This analysis of
words from queries associated with performance suggests
that the pair matching approach worked best when the
documents used a domain-specific vocabulary.
Proper Name Identification.
At the other extreme, topics that used more domain-
general words had lower performance. In particular, queries
that asked for a category of documents, such as indicated by
words such as `who' and `type' were more likely in the failure
category. Words including: `system, national, following,
been, and two' were also associated with higher failure rates.
This suggests that proper noun compounds may require
special treatment. The names of organizations, products,
locations, etc. cannot apparently be easily identified through
direct pair matching when these specific proper nouns are
not contained in the query. When such specific results are
called for by a query, special procedures are probably
desirable for identification of proper nouns in documents that
match on other query pairs.
Domain Specificity of Words.
Table 3: Query words & performance correlations
WORD r No. of Topics
Difference
to .2743* 15
some .2480* 10
who 3570** 8
more .2509* 8
type 3740** 6
following .3069* 6
been .2580* 6
two 3750** 5
Failure
national .2479* 11
system .2479* 9
who 3828** 8
more .2426* 8
been .2545* 6
two 4100** 5
support .2479* 5
* p < .011 ** p < .001
An additional implication is that query expansion may
be fruitful when dealing with domain-transcendent words.
Through use of thesauri or databases such as WordNet,
alternative word meaning senses may be disambiguated.
Then synonyms specific to the proper domain could be
added to the actual query pairs contained in the original raw
query text.
Interestingly, queries that contained the words `some'
resulted in higher performance. This may suggest that the
criteria for relevance were less stringent for such queries, in
that they asked not for an exhaustive and complete fit of
query to documents, but a more partial overlap. The word
`to' in queries was also associated with higher performance.
This may be associated with the specificity of this word in
discourse, indicating relationships of direction, degree, state,
contact, possession, etc.
Natural Language Processing on Quedes.
Together, such query-focused results suggest that future
work may benefit from performing complex natural language
processing such as parsing, sense disambiguation, etc. on the
queries themselves to tune them before matching.
Sophisticated treatment of queries may improve performance
to the point that such treatment of the raw texts themselves,
which is expensive, may not add much marginal
performance improvemenL
133