NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Proximity-Correlation for Document Ranking: The PARA Group's TREC Experiment chapter M. Zimmerman National Institute of Standards and Technology Donna K. Harman the TREC topic statements plUs a few obvious synonyms which occurred to me as I typed in the queries. I converted all characters in the TREC document set to upper[OCRerr]ase before processing, so my regexps ignored case issues. To handle the proximity (boolean-ANDAike requirement) among separate terms, and to generate estimated relevances, I invented a very simple scoring system. Every time a line in a document matched one of my regexps, I added an arbitrary 5 points to that regexp's score. A line in the document got a score equal to the product of the regexp scores for that query. When moving on to the next line of a document, I multiplied all regexp scores by 0.9, to make them fade away with a characteristic length scale of 10 lines (=11(1-0.9)). The estimated relevance for a document as a whole was the maximum relevance of any line in that document. For terms which were to be negatively[OCRerr]weighted (boolean-NOT-like), as in TREC Topic 026, instead of multiplying in the regexp's score to get a line's relevance, I multiplied in 5 minus that score. I experimented briefly with different weights and relevanc[OCRerr]length[OCRerr]scales for different terms, but decided that there was neither sufficient time nor much benefit to be gained that way, and so I settled on the S-points-per-instance, 0.9 degradation-rate standard. Implementation I performed my TREC experiments over a period of a few weeks in background on a NeXT workstation (an old Cube), using the (rather slow) built-in writeable optical disk to hold data copied from the TREC CD-ROMs. On the average, each TREC query took about on[OCRerr]third of a second per document to execute. My implementation ran 10 queries at a time and generated a line of output for each document, listing the document ID followed by (the integer part of) its estimated relevance, in 10 columns. I then used additional UNIX shell scripts containing coirm and sort and head to tabulate the 200 highest- ranked documents for each topic. A final Gawk program converted my results into the standard TREC format. The total wall[OCRerr]lock computation time which I used was about 11 days for each CD-ROM of data. Late in the process of scoring documents, I discovered that a bug in my programs caused the last document in a data set not to be ranked - but I had no time to fix the error, and it probably did not affect my overall results significantly. Further Work I believe that the proximity[OCRerr]correlation approach used in my TREC experiment has promise for other relevance[OCRerr]ranking tasks. Almost certainly, the use of a compiled language (and perhaps a simpler, faster pattern-matching facility) rather than Gawk would result in a speed increase of an order of magnitude or more. Much higher speeds should be achievable by inverted-index methods. User feedback might be valuable in modifying the default settings for term weights and relevance-lengths. A thesaurus option to automatically generate synonyms could help automate the query[OCRerr]reation process. References [1] Mark Zimmermann. "The FreeText Project: Large-Scale Personal Information Retrieval", in Delany & Landrow, TEXT-BASED COMPUTING IN ThE HUMANmES (to be published by MIT Press, early 1993), pps 51[OCRerr]66. 354