SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) WORDIJ: A Word Pair Approach to Information Retrieval chapter J. Danowski National Institute of Standards and Technology Donna K. Harman PIPELINE matching of the query pairs against the pair files for each text file executed in approximately 16 milliseconds per file per 100 sets of query pairs. This meant that to run all 100 queries against the entire collection took approximately five hours of PIPELNE processing on the word pair index files, or three minutes per query. Time constraints precluded completing a word and word-pair by document count on the entire collection for inverse document frequency or entropy word and word pair weighting. Retrieved documents were ranked from 1 to 200 by counting the number of matching pairs each document had to the query. Frequency of pair occurrence in documents was not used to weight except in breaking ties at the 200 document-rank threshold. Time limitations also prevented full implementation of the indirect matching process. Only direcdy matching pairs were used for the main analysis to produce the results. Indirect matching was, however, later tested. This will be described after presentation of the basic results. RESULTS WORDij results were greater than or equal to the median levels of performance for seven topics. Our results were within one standard deviation on 55 topics, and within two standard deviations on 82 topics. Performance was significantly lower than the median for 14 topics, as judged by counting topics whose results were greater than two standard deviations below the median. Table 1 lists the topics in two categories, those that were better than or equal to the median, and those that were significantly below the median. Failure Analysis Query Style. Several kinds of failure analysis were performed. To investigate whether stylistic features of queries were associated with performance, we computed the following variables for each query using the shareware program, PC- STYLE: Number of Sentences Number of Words Words per sentence Percentage of long words Percentage of personal words Percentage of action verbs Average number of syllables per word Table 1: Topic Results Ordered by Performance TOPIC Difference (median - result) Better than or Equal to Median 66 .08980 Natural Language Processing 29 .04540 OS/2 problems 94 .03180 Computer-aided Crime 95 .00800 Computer-aided Crime Detection 18 .00000 Global Stock Market Trends 44 .00000 What Makes CASE succeed or fail 88 .00000 Crude Oil Price Trends 100 .00000 Controlling High Tech Transfer 50 .00250 Virtual Reality Military Apps. Significantly Below Median (Failures) 22 .19590 Legal Repercus. -Agrochemicals 58 .20740 Rail Strikes 37 .21290 Role of Minis and Mainframes 20 .21770 Superconductors 77 .23290 Poaching 17 .24350 Japanese Stock Market Trends 93 .24560 What Backing Does the NRA Have 13 .24780 Drug Approval 54 .26840 Satellite Launch Contracts 51 .29490 Airbus Subsidies 10 .33340 Space Program 70 .35440 Surrogate Motherhood 78 .38240 Greenpeace 21 .48710 Counternarcotics Reading grade level These variables were correlated with a criterion variable, which was the difference between the median and our result. We subtracted for each query our obtained result from the median result on the 1 1-point averages of recall-precision contained in the official results across systems for the test queries 51-100. Table 2 displays these correlations. None of them are statistically significant at the .01 level. A second criterion variable was created to represent whether the query was in the t1failed" category, greater than two standard deviations below the median. A dummy variable was created for each query using zero to represent success and one to represent failure. Correlations of the style variables were also computed with the failure criterion. No correlations were significant at the .01 level. This suggests that query length, complexity, and other stylistic variables are unrelated to retrieval performance. Query Words. 132